Open tomtong2015 opened 2 months ago
Hi, not sure exactly what the issue is (maybe @ElieHammou or @comane) can help more, but when you do:
n3fit tom_01.yml 1000
you are not generating 1000 replicas, you are generating the replica # 1000.
To generate 1000 replicas you need to launch different jobs (better) or loop.
If you run with fixed_pdf_fit: False
, it works?
Hi Tom, I was discussing with @FrancescoMerlotti and he reminded me that there may be an issue with fixed-PDF fits with only one coefficient on mac systems. Could you try to turn another coefficient on for 1 replica to see if the eror goes away? Also if you have access to a Linux system, it could be useful to test the runcard there as I think this issue does not come up.
Hi, not sure exactly what the issue is (maybe @ElieHammou or @comane) can help more, but when you do:
n3fit tom_01.yml 1000
you are not generating 1000 replicas, you are generating the replica # 1000.To generate 1000 replicas you need to launch different jobs (better) or loop.
Hi Luca, thank you very much! I will launch different jobs for more replicas.
If you run with
fixed_pdf_fit: False
, it works?
Yes, it seems to work. I switched True
to False
while keeping everything else the same:
############################################################
# Uncomment to perform fixed-PDF fit
fixed_pdf_fit: False
load_weights_from_fit: 221103-jmm-no_top_1000_iterated
Below is the last part of the output:
[INFO]: At epoch 9800/30000, total chi2: 1.5794931972728057
HERACOMB: 1.841, CMS: 0.978, ATLAS: 0.465, LEP: 1.974, ATLAS-CMS: 0.866, total: 1.579
Validation chi2 at this point: 2.702366352081299
[INFO]: ['8.47e-03']
[INFO]: At epoch 9900/30000, total chi2: 1.5831192203596527
HERACOMB: 1.848, CMS: 0.974, ATLAS: 0.470, LEP: 1.974, ATLAS-CMS: 0.866, total: 1.583
Validation chi2 at this point: 2.711261510848999
[INFO]: ['9.96e-03']
[INFO]: At epoch 10000/30000, total chi2: 1.580635497149299
HERACOMB: 1.844, CMS: 0.980, ATLAS: 0.452, LEP: 1.974, ATLAS-CMS: 0.865, total: 1.581
Validation chi2 at this point: 2.707155227661133
[INFO]: ['1.20e-02']
[INFO]: At epoch 10100/30000, total chi2: 1.573520331289254
HERACOMB: 1.841, CMS: 0.947, ATLAS: 0.454, LEP: 1.974, ATLAS-CMS: 0.865, total: 1.574
Validation chi2 at this point: 2.71159029006958
[INFO]: ['1.05e-02']
[INFO]: At epoch 10200/30000, total chi2: 1.5766352560005936
HERACOMB: 1.845, CMS: 0.944, ATLAS: 0.474, LEP: 1.974, ATLAS-CMS: 0.865, total: 1.577
Validation chi2 at this point: 2.7088916301727295
[INFO]: ['3.16e-03']
[INFO]: Stopped at epoch=10223
1/1 [==============================] - 0s 291ms/step
1/1 [==============================] - 0s 309ms/step
1/1 [==============================] - 0s 16ms/step
1/1 [==============================] - 0s 16ms/step
1/1 [==============================] - 0s 17ms/step
[INFO]: Best fit for replica #123, chi2=1.021 (tr=1.635, vl=2.492)
[INFO]: > Saving the weights for future in /Users/tomtong/Desktop/SIMUnet/SIMUnet_runs/tom_02/nnfit/replica_123/weights.h5
(simunet) tomtong@Toms-Air SIMUnet_runs %
Is it running properly?
Hi Tom, I was discussing with @FrancescoMerlotti and he reminded me that there may be an issue with fixed-PDF fits with only one coefficient on mac systems. Could you try to turn another coefficient on for 1 replica to see if the eror goes away? Also if you have access to a Linux system, it could be useful to test the runcard there as I think this issue does not come up.
Hi Elie, many thanks to both of you! I truly appreciate the help!
As Luca pointed out, fixed-PDF seems to be the issue. Of course, I'm in no position to make such conclusions. You guys are the experts π
I also tried a fixed-PDF fit with 3 Wilson coefficients turned on. It seems that the problem remains. Below is the last part of the output:
==================================================================================================
Total params: 27,554
Trainable params: 3
Non-trainable params: 27,551
__________________________________________________________________________________________________
[INFO]: Using weights from fit: 221103-jmm-no_top_1000_iterated
[INFO]: Loading weights from path: /opt/anaconda3/envs/simunet/share/NNPDF/results/221103-jmm-no_top_1000_iterated/nnfit/replica_456/weights.h5
[WARNING]: > NaN found, stopping activated
[INFO]: Stopped at epoch=1
1/1 [==============================] - 0s 287ms/step
1/1 [==============================] - 0s 300ms/step
1/1 [==============================] - 0s 15ms/step
1/1 [==============================] - 0s 15ms/step
1/1 [==============================] - 0s 16ms/step
[INFO]: Best fit for replica #456, chi2=nan (tr=nan, vl=1.928)
[INFO]: > Saving the weights for future in /Users/tomtong/Desktop/SIMUnet/SIMUnet_runs/tom_03/nnfit/replica_456/weights.h5
(simunet) tomtong@Toms-Air SIMUnet_runs %
Well, it also could be an issue with the infamous Apple silicon and the translation. I can try it on a Linux system after I get our admin's approval.
Hi Tom, I was discussing with @FrancescoMerlotti and he reminded me that there may be an issue with fixed-PDF fits with only one coefficient on mac systems. Could you try to turn another coefficient on for 1 replica to see if the eror goes away? Also if you have access to a Linux system, it could be useful to test the runcard there as I think this issue does not come up.
Hi Elie, many thanks to both of you! I truly appreciate the help!
As Luca pointed out, fixed-PDF seems to be the issue. Of course, I'm in no position to make such conclusions. You guys are the experts π
I also tried a fixed-PDF fit with 3 Wilson coefficients turned on. It seems that the problem remains. Below is the last part of the output:
================================================================================================== Total params: 27,554 Trainable params: 3 Non-trainable params: 27,551 __________________________________________________________________________________________________ [INFO]: Using weights from fit: 221103-jmm-no_top_1000_iterated [INFO]: Loading weights from path: /opt/anaconda3/envs/simunet/share/NNPDF/results/221103-jmm-no_top_1000_iterated/nnfit/replica_456/weights.h5 [WARNING]: > NaN found, stopping activated [INFO]: Stopped at epoch=1 1/1 [==============================] - 0s 287ms/step 1/1 [==============================] - 0s 300ms/step 1/1 [==============================] - 0s 15ms/step 1/1 [==============================] - 0s 15ms/step 1/1 [==============================] - 0s 16ms/step [INFO]: Best fit for replica #456, chi2=nan (tr=nan, vl=1.928) [INFO]: > Saving the weights for future in /Users/tomtong/Desktop/SIMUnet/SIMUnet_runs/tom_03/nnfit/replica_456/weights.h5 (simunet) tomtong@Toms-Air SIMUnet_runs %
Well, it also could be an issue with the infamous Apple silicon and the translation. I can try it on a Linux system after I get our admin's approval.
Hi Tom, I think it is an issue with Apple Silicon indeed, it might be related to some version of Tensorflow for Mac. I have to find it! As Elie said, a Linux machine should work just fine, and the runcard should be right as well.
Hi Tom, I was discussing with @FrancescoMerlotti and he reminded me that there may be an issue with fixed-PDF fits with only one coefficient on mac systems. Could you try to turn another coefficient on for 1 replica to see if the eror goes away? Also if you have access to a Linux system, it could be useful to test the runcard there as I think this issue does not come up.
Hi Elie, many thanks to both of you! I truly appreciate the help! As Luca pointed out, fixed-PDF seems to be the issue. Of course, I'm in no position to make such conclusions. You guys are the experts π I also tried a fixed-PDF fit with 3 Wilson coefficients turned on. It seems that the problem remains. Below is the last part of the output:
================================================================================================== Total params: 27,554 Trainable params: 3 Non-trainable params: 27,551 __________________________________________________________________________________________________ [INFO]: Using weights from fit: 221103-jmm-no_top_1000_iterated [INFO]: Loading weights from path: /opt/anaconda3/envs/simunet/share/NNPDF/results/221103-jmm-no_top_1000_iterated/nnfit/replica_456/weights.h5 [WARNING]: > NaN found, stopping activated [INFO]: Stopped at epoch=1 1/1 [==============================] - 0s 287ms/step 1/1 [==============================] - 0s 300ms/step 1/1 [==============================] - 0s 15ms/step 1/1 [==============================] - 0s 15ms/step 1/1 [==============================] - 0s 16ms/step [INFO]: Best fit for replica #456, chi2=nan (tr=nan, vl=1.928) [INFO]: > Saving the weights for future in /Users/tomtong/Desktop/SIMUnet/SIMUnet_runs/tom_03/nnfit/replica_456/weights.h5 (simunet) tomtong@Toms-Air SIMUnet_runs %
Well, it also could be an issue with the infamous Apple silicon and the translation. I can try it on a Linux system after I get our admin's approval.
Hi Tom, I think it is an issue with Apple Silicon indeed, it might be related to some version of Tensorflow for Mac. I have to find it! As Elie said, a Linux machine should work just fine, and the runcard should be right as well.
Hi Francesco, thank you very much! I'll try a Linux machine as soon as possible, and come back to you for further guidance π
System:
MacBook Air with M1 chip Memory 8GB macOS 12.6
Environment:
Latest Python + Anaconda Created a conda environment,
simunet
, according to your tutorial All dependencies installed successfully, and the environment activated. SIMUnet has been downloaded, compiled under the environment, and installed successfully.Runcard:
The runcard is based on your example Following is my full runcard:
Modifications in the runcard:
Since I'd like to try a fixed-PDF fit, the following flag has been uncommented:
As a test run, all SMEFT operators have been commented out except for one,
OtG
, which has been turned on:The theory id has been set to 270. Thank you, Elie! π
Full output messages:
I was trying to make 1000 replicas.
The process appears to have stopped too early, indicated by
and
Thank you very much in advance!