NNPDF / nnpdf

An open-source machine learning framework for global analyses of parton distributions.
https://docs.nnpdf.science/
GNU General Public License v3.0
29 stars 6 forks source link

No agreement between parallel gpu and sequential cpu fits #2118

Open RoyStegeman opened 3 weeks ago

RoyStegeman commented 3 weeks ago

I performed fits using both parallel gpu sequential cpu settings: https://vp.nnpdf.science/HrLhzWqgSEyMVbKRWuSmOA==/, unfortunately the results to do not agree. Agreement between the two was found by Aron, so I would like to understand why I do not find the same.

Reasons could be a problem with my environment or hardware (or interplay between), or perhaps over the past months a bug was introduced in the code.

Cmurilochem commented 3 weeks ago

Hi @RoyStegeman and @Radonirinaunimi.

I ran the provided runcard using the current branch I was using for the hyperopt experiments. I added parallel_models: true to it and ran with 120 replicas (to be able to get 100 later in postfit). Apparently, the fit just stops at epoch 1065, see part of the output below:

[INFO]: Epoch 100/17000: loss: 346479296.0000000
Validation loss after training step: 6054287.0000000.
Validation chi2s:
[INFO]: Epoch 200/17000: loss: 2397694464.0000000
Validation loss after training step: 898655.7500000.
Validation chi2s:
[INFO]: Epoch 300/17000: loss: 210753856.0000000
Validation loss after training step: 709461.3750000.
Validation chi2s:
[INFO]: Epoch 400/17000: loss: 109637336.0000000
Validation loss after training step: 1609623.7500000.
Validation chi2s:
[INFO]: Epoch 500/17000: loss: 142071760.0000000
Validation loss after training step: 2423371.0000000.
Validation chi2s:
[INFO]: Epoch 600/17000: loss: 237521776.0000000
Validation loss after training step: 356872.8125000.
Validation chi2s:
[INFO]: Epoch 700/17000: loss: 81304864.0000000
Validation loss after training step: 4219186.0000000.
Validation chi2s:
[INFO]: Epoch 800/17000: loss: 125517096.0000000
Validation loss after training step: 5745484.5000000.
Validation chi2s:
[INFO]: Epoch 900/17000: loss: 65846024.0000000
Validation loss after training step: 1957665.2500000.
Validation chi2s:
[INFO]: Epoch 1000/17000: loss: 49473056.0000000
Validation loss after training step: 1433478.5000000.
Validation chi2s:
[WARNING]:  > NaN found, stopping activated
[INFO]: Stopped at epoch=1065

Just in case, I also attach here the run_fit.slurm script I am using in snellius and the output.

I am also trying to run it with parallel_models = false but it is taking much longer than I expected.

Radonirinaunimi commented 3 weeks ago

@Cmurilochem This seems really strange and indeed seem to point that there is definitely something wrong somewhere. But now I am wondering how the hyperopts are even working? Which commit are you exactly using?

Cmurilochem commented 3 weeks ago

Hi @Radonirinaunimi. The last commit of my local master is 613bfbc651c4439ba6b555e6179e5a7b797dcdcf (Add 'average_best' and non-default 'penalties_in_loss' in hyperopt_studies runcards).

I am also interested in checking the results before https://github.com/NNPDF/nnpdf/pull/2014, and I will make a test by checking out to fb3f384495f1d0d1daa845cdb784e66e47558c66 (Merge pull request #2065 from NNPDF/add_hyperopt_limited_space_runcard).

goord commented 1 week ago

Testing the pull request faadba05dc with 20 replicas on the CPU does show a significant difference with the baseline on my system as well (python 3.11, TF2.16).

Radonirinaunimi commented 1 week ago

Testing the pull request faadba0 with 20 replicas on the CPU does show a significant difference with the baseline on my system as well (python 3.11, TF2.16).

Thanks for checking @goord. So we indeed now have a consensus that it is not working, even for the old commits.

goord commented 1 week ago

Right, so instead of focussing to recreate the results posted previously, we will look at what is exactly happening in the current master.

I still suspect a tensor flow issue btw. I believe the old test results were produced with the tensirflow-gpu pip package, which may somehow have slightly different behavior than the official package...

Radonirinaunimi commented 1 week ago

Right, so instead of focussing to recreate the results posted previously, we will look at what is exactly happening in the current master.

Yes, I agree. Do you have some ideas on how to proceed?

I still suspect a tensor flow issue btw. I believe the old test results were produced with the tensirflow-gpu pip package, which may somehow have slightly different behavior than the official package...

I don't think this is the case though. I did try yesterday doing so with the exact same environment as in this report and still got the wrong results. If this is the case, we should be able to reproduce the results.

goord commented 1 week ago

Hi @Radonirinaunimi but the report only mentions a code version right? Not the fact that the gpu results were obtained with https://pypi.org/project/tensorflow-gpu/2.10.0/ rather than the official TF2.10.

In any case, my plan is to take the current master and compare parallel and sequential losses for the first few epochs, probably using the simple runcard. If those are equivalent, the difference could be in the stopping conditions handling or some other post-processing step.

RoyStegeman commented 1 week ago

Testing the pull request https://github.com/NNPDF/nnpdf/commit/faadba05dc1128107553032c0ba619cc21ef67e5 with 20 replicas on the CPU does show a significant difference with the baseline on my system as well (python 3.11, TF2.16).

Thanks for doing the tests. At that point there was some incompatibility between n3fit and tensorflow 2.16, because Aron's implementation of the MultiDense stuff used a tensorflow internal object Dense.kernel, while at the moment we access Dense._kernel. Looking at the tf code you can see that's it's not a public attribute so it's something that can break MultiDense without warning: https://github.com/keras-team/keras/blob/v3.3.3/keras/src/layers/core/dense.py#L15-L603

The PR that made n3fit compatible with tf2.16: https://github.com/NNPDF/nnpdf/pull/1975

scarlehoff commented 1 week ago

I can have a look at this, but looking through the PR I'm no longer sure Aron ever found agreement with the full current version (multidense, trvl, etc)

I'll do it in the context of #1939

goord commented 4 days ago

@scarlehoff do you find agreement between sequential and parallel fits without the multidense layer type? I recall that @Radonirinaunimi reverted to the trvl-mask-layer branch, so still didn't find agreement even without multidense layers...

scarlehoff commented 4 days ago

Are you sure that the latest version of the trvl-mask-layer was not rebased on top of multidense?

In any case, maybe there was another bug there in the middle. I don't think that branch existed by itself for long (it was constantly being merged/rebased on other branches)

goord commented 4 days ago

Yes it has been rebased on top of master when multidense was in there, I realize now. @Radonirinaunimi this is perhaps why you can't reproduce earlier results. We will test whether the baseline can be reproduced with single dense and debug the multidense with masking layers.

Radonirinaunimi commented 4 days ago

Yes it has been rebased on top of master when multidense was in there, I realize now. @Radonirinaunimi this is perhaps why you can't reproduce earlier results. We will test whether the baseline can be reproduced with single dense and debug the multidense with masking layers.

Yes, that's right! It's a shame that I did not notice it before. Now with https://github.com/NNPDF/nnpdf/pull/2127 I can also get agreement (with V100).

goord commented 4 days ago

@APJansen did perform a regression test https://vp.nnpdf.science/FGwCRitnQhmqBYBjeQWS7Q==/ but it's not clear to me whether this is using parallel replicas (using the same training validation mask) or sequential ones.