NNPDF / nnpdf

An open-source machine learning framework for global analyses of parton distributions.
https://docs.nnpdf.science/
GNU General Public License v3.0
30 stars 6 forks source link

Short roadmap for a first fit with theory uncertainties #292

Closed scarrazza closed 5 years ago

scarrazza commented 6 years ago

Lets summarize here the steps discussed in Milan in order to obtain a first prototype fit with theory uncertainties:

scarrazza commented 6 years ago

So far, we have started the implementation all these points. @rabah-khalek could you paste here a histogram with the MC replica gen. time we have discussed last week?

wilsonmr commented 6 years ago

So the 3.1 fit with one experiment is running in Edinburgh, and has been since Thursday. The problem is that asking for 8Gb virtual memory and enough hours for it to finish means the queue time is very slow, I still have 100 out of 150 waiting in the queue.

To feasibly run these fits we either need machines with priority time or perhaps to consider checkpointing so that we could do shorter runs, because 150 replicas looks like it's going to take over a week here

Zaharid commented 6 years ago

@wilsonmr how does that compare with a normal fit? By my measures, performance wasn't a concern. But does memory increase very much?

wilsonmr commented 6 years ago

I'm sure we were running with 4Gb vmem previously, I think asking for more than 4Gb vmem does contribute to the increased queue time a lot.

That being said I think this queue time issue would apply to any global 3.1 nnlo fit at Edinburgh.

rabah-khalek commented 6 years ago

@scarrazza, sorry, I wasn't very responsive last week. I am generating the histogram of time consumption, I'll attach it here asap.

rabah-khalek commented 6 years ago

The correlated plot is to check for any inconsistencies (change in cpu usage), the arithmetic average time is 52 secs for a replica: time_consumption.pdf

For more information if needed, this was for 4295 data points in total after kinematic cuts using this run card: test_genrep.txt

scarrazza commented 6 years ago

@rabah-khalek great thanks. Would be great if you could generate a similar plot with PR #279 and check if the output from both codes are similar.

scarrazza commented 6 years ago

Here the 3.1 fit using a single experiment: https://vp.nnpdf.science/IX86w9irQA-fHWtWWFSrgQ==/ Seems acceptable in terms of performance (takes the usual 30-40h / replica) and it is statistically equivalent to the official 3.1.

Zaharid commented 6 years ago

Note that many actions load data from the fit directly and so are comparing exactly the same thing. The PDF plots should be fine and do look consistent. However these are worrisome:

https://vp.nnpdf.science/IX86w9irQA-fHWtWWFSrgQ==/#training-validation

Zaharid commented 6 years ago

You can see that we are getting less controlled uncertainties, which spills into the positivity plots:

https://vp.nnpdf.science/IX86w9irQA-fHWtWWFSrgQ==/figures/matched_positivity_from_dataspecs4_dataspecs0_plot_positivity.pdf

and the replica plots.

https://vp.nnpdf.science/IX86w9irQA-fHWtWWFSrgQ==/figures/pdf_report_basespecs0_pdfscalespecs0_plot_pdfreplicas_u.pdf (see the lazy one going though the middle)

scarrazza commented 6 years ago

This is a full fit with 30k iterations and 100 replicas, so nothing much to say except that preprocessing and t0 may be not perfectly optimized.

Concerning the vp2 report, I noticed too that the chi2 is always identical to the base fit id. I have specified a custom pdf set for the reference fit, so I would expect that:

but the last point is not working, is that expected?

Zaharid commented 6 years ago

I guess the various actions in the report are fits_chi2_table and such, which takes fits and loads PDF, theory and experiments based on them, and it is what you want for vp-comparefits.

For more general comparisons there is e.g. ~ $ validphys --help dataspecs_chi2_table, which will resolve the symbols within "dataspecs" however you tell it. See:

https://data.nnpdf.science/validphys-docs/guide.html#general-data-specification-the-dataspec-api

scarrazza commented 6 years ago

Comparison to the 1000 replicas from 3.1 NNLO: https://vp.nnpdf.science/pygXMvLRS7y4FnkZX8xzAQ==

Zaharid commented 6 years ago

Seems a little surprising that a single systematic can do so much damage... Note how most of the outliers in the replica plots are concentrated on the bigexp fit, despite having much fewer replicas.

scarrazza commented 6 years ago

If I understand correctly, there are 2 important differences when loading a single experiment:

Zaharid commented 6 years ago

What are CORR systematics? Didn't we check we don't have any of these across different experiments?

scarrazza commented 6 years ago

I have just performed another pass, more carefully, and I think the current code is fine.

Lets try to remember the mechanism in Experiment::PullData: https://github.com/NNPDF/nnpdf/blob/9da35f35b704683d8d0c884f61bb0ea3860994bd/libnnpdf/src/experiments.cc#L313

After this step, we move towards the Experiment::GenCovMat: https://github.com/NNPDF/nnpdf/blob/9da35f35b704683d8d0c884f61bb0ea3860994bd/libnnpdf/src/experiments.cc#L435

Could you please confirm this interpretation?

scarrazza commented 6 years ago

Here the comparisons between:

Conclusion: more evidence that the bigexp is potentially bugged.

wilsonmr commented 6 years ago

Not sure where to document this but today I was trying to make a start into understanding the issues with one big exp

I was trying to run vp-setupfit on both the runcards but edited so that they are generating pseudo data. My hope was that if the problem is in the sampling of the cov mat then if I generate some pseudo data then I could see this by examining the per replica chi2 using fit common data. Having said that vp-setupfit is really hanging on the big experiment one so I'm not sure if it will finish. I will continue looking at this tomorrow

wilsonmr commented 6 years ago

Ok So I did what I said I would do, however I then realised that it wasn't very useful since I essentially have a single replica, and therefore no statistics so I have no idea if the difference in chi2 is meaningful or not: chi2 between predictions from 181023-001-sc and level 1 data generated for many experiments is 0.9963246694831946 chi2 between predictions from 181023-001-sc and level 1 data generated for one experiment is 1.0404018331018645 Both chi2s should probably be order 1. The disparity between one exp data chi2 and 1 is an order of magnitude greater, but I would need many more replicas to draw a conclusion. Sampling of the covmat definitely could be source of error then, but I think it could be easier to find bug in code than construct a more elaborate way of testing it.

wilsonmr commented 6 years ago

@scarrazza I also have a breakdown by dataset here

Click to expand ``` one exp master ndata $\chi^2/ndata$ ndata $\chi^2/ndata$ ATLAS1JET11 31.0 0.969544 31.0 0.614544 ATLASLOMASSDY11EXT 6.0 0.649722 6.0 1.095470 ATLASR04JETS2P76TEV 59.0 6.452978 59.0 1.329281 ATLASR04JETS36PB 90.0 1.302735 90.0 0.784377 ATLASTOPDIFF8TEVTRAPNORM 10.0 1.123830 10.0 1.494451 ATLASTTBARTOT 3.0 0.627229 3.0 0.340363 ATLASWZRAP11 34.0 0.688400 34.0 0.997867 ATLASWZRAP36PB 30.0 1.234040 30.0 1.013619 ATLASZHIGHMASS49FB 5.0 1.914251 5.0 0.728672 ATLASZPT8TEVMDIST 44.0 0.822210 44.0 0.967914 ATLASZPT8TEVYDIST 48.0 0.403290 48.0 0.377827 BCDMSD 248.0 0.956049 248.0 0.868745 BCDMSP 333.0 1.031702 333.0 1.015092 CDFR2KT 76.0 1.396881 76.0 0.980050 CDFZRAP 29.0 1.141392 29.0 1.200180 CHORUSNB 416.0 0.807629 416.0 0.890862 CHORUSNU 416.0 0.926462 416.0 0.914832 CMS1JET276TEV 81.0 0.965929 81.0 0.874193 CMSDY2D11 110.0 0.920325 110.0 0.924618 CMSJETS11 133.0 1.065505 133.0 0.793712 CMSTOPDIFF8TEVTTRAPNORM 10.0 0.600950 10.0 0.687690 CMSTTBARTOT 3.0 0.064822 3.0 0.804884 CMSWEASY840PB 11.0 0.848673 11.0 1.426723 CMSWMASY47FB 11.0 1.368294 11.0 1.289498 CMSWMU8TEV 22.0 0.802183 22.0 2.158299 CMSZDIFF12 28.0 0.247326 28.0 0.650090 D0WEASY 8.0 1.183201 8.0 0.857144 D0WMASY 9.0 0.749494 9.0 0.633287 D0ZRAP 28.0 1.183911 28.0 1.374727 DYE605 85.0 1.262706 85.0 1.028439 DYE886P 89.0 1.208059 89.0 1.089166 DYE886R 15.0 0.983193 15.0 0.946225 H1HERAF2B 12.0 0.669507 12.0 0.492801 HERACOMBCCEM 42.0 0.967536 42.0 1.044062 HERACOMBCCEP 39.0 1.009988 39.0 0.683417 HERACOMBNCEM 159.0 0.865649 159.0 1.167475 HERACOMBNCEP460 204.0 1.028335 204.0 1.186964 HERACOMBNCEP575 254.0 0.907054 254.0 1.080998 HERACOMBNCEP820 70.0 0.986784 70.0 0.843086 HERACOMBNCEP920 377.0 0.935256 377.0 1.181384 HERAF2CHARM 37.0 1.194203 37.0 1.099026 LHCBWZMU7TEV 29.0 0.917800 29.0 0.926044 LHCBWZMU8TEV 30.0 0.654508 30.0 0.753092 LHCBZ940PB 9.0 2.029541 9.0 1.685865 LHCBZEE2FB 17.0 0.641049 17.0 0.681717 NMC 204.0 1.003102 204.0 1.065402 NMCPD 121.0 0.900942 121.0 1.162778 NTVNBDMN 37.0 1.206507 37.0 0.956581 NTVNUDMN 39.0 1.214189 39.0 0.959539 SLACD 34.0 1.136617 34.0 0.997252 SLACP 33.0 0.534122 33.0 0.890408 ZEUSHERAF2B 17.0 1.287272 17.0 0.546934 ```

In particular there is a chi2 of like 6.5 for ATLASR04JETS2P76TEV comparing to the one exp level 1 data, it's unclear if this could be evidence of the bug or just statistically what we could expect?

scarrazza commented 6 years ago

Thanks, so you have computed the breakdown using my both fits, right? (bigexp and 3.1master) Yeah, this jet dataset contains the special TH* systematics so maybe that's the problem... so we probably have a big problem in the artificial data generation.

wilsonmr commented 6 years ago

Sorry the headings are quite complicated, I will try to explain. I got the runcards for both of your fits and then changed them minimally to be level 1 closure test runcards, then I ran vp-setupfit on these 2 new runcards (so essentially generate a single replica of pseudo data). Now I take the PDF from your 3.1master fit, although I'm thinking that actually I should use the fakepdf (NNPDF31_nnlo_as_0118) and calculate chi2 for both of my fake closure tests using fit common data. We would expect that total chi2 should be 1 if the data was generated correctly. I then also provide the same information broken down per dataset.

Does this make sense? It's a bit convoluted but the point is that the data generated for a single replica of ATLASR04JETS2P76TEV has a much higher chi2 than one would expect (chi2 per data point of ~6 for 59 data points)

scarrazza commented 6 years ago

Ok, I see, yes sounds as in interesting test.

wilsonmr commented 6 years ago

Ok, So I have slightly refined the process of generating replicas and comparing to underlying PDF. The replicas take a long time to generate (especially for big exp) but I will generated 100 replicas worth of data and produce the same information as above but with statistics, since the issue is quite subtle I think the difference won't be apparent without a decent amount of replicas. Should have results this afternoon

scarrazza commented 6 years ago

I have performed 2 DIS-only fits (3.1 NNLO setup):

Here the report: https://vp.nnpdf.science/UNUUK4CjRvmIlps-dXV14Q==/ Looks like differences are much smaller than a full 3.1 dataset fit.

Zaharid commented 6 years ago

It doesn't seem like the same thing:

https://vp.nnpdf.science/UNUUK4CjRvmIlps-dXV14Q==/figures/matched_positivity_from_dataspecs4_dataspecs0_plot_positivity.pdf

wilsonmr commented 6 years ago

On Friday I made the table which I said I would do however it occured to me that comparing chi2 per dataets is not the most sensible thing: datasets where the systematics are an important contribution to the covariance matrix have poor chi2 for both of the original two fits ran. I think generating the data as per each fit and then calculating chi2 by experiment where experiments is defined in the standard way would give a more reliable comparison.. Since it's not clear to me if we drop inter-dataset systematics if comparing the chi2s actually makes any sense

I also think it's curious that the positivity sets are very comparable between the fits if one uses robust statistics - report to follow.

Are we sure there is actually a problem? I was speaking to Luigi about it and he was curious if there was a 1000 replica fit with one experiment if we would see the same proportion of outliers in the E_valid vs E_train plot or whether those outliers we see in the 100 replica fit are just unfortunate for that run

wilsonmr commented 6 years ago

So actually when I uses full quantile stats the positivity plots look incredibly similar:

https://vp.nnpdf.science/XOVZ4BYmTCSq8ZPcyO9TFA==

Zaharid commented 6 years ago

@wilsonmr you are right that it is better to compare by experiment, but I am still curious to see the table per dataset. It is a bad idea to read too much into the per-dataset chi2 (the alphas paper is a good proof of that), but still it is interesting to see if we can find some difference when sampling. Note that the assumption is that the two things are exactly statistically equivalent and it doesn't really matter if the thing we are looking at is a bit crazy (also in practice we do get chi2~1 for the dataset chi2 in level1). So this is a long way to say that the table will be interesting.

As for the fits, I take your point that the DIS one may be bad luck, but the hadronic ones are clearly different: there are more outliers in the ~100 single exp replicas than in the 1000 original ones.

wilsonmr commented 6 years ago

Well here is the table, I can make a neater version

Click to expand ``` Fit master31 oneexp mean standard deviation mean standard deviation NMCPD 1.023462 0.017986 1.001799 0.020072 NMC 0.996017 0.012748 0.994399 0.016455 SLACP 0.997704 0.033720 1.020341 0.033962 SLACD 0.996347 0.030855 0.986775 0.030508 BCDMSP 1.015253 0.014836 1.021481 0.014140 BCDMSD 0.998369 0.014540 1.006625 0.015904 CHORUSNU 1.021312 0.022834 1.035280 0.023504 CHORUSNB 0.999729 0.020384 1.009661 0.025169 NTVNUDMN 1.057533 0.030362 1.013755 0.034507 NTVNBDMN 0.998571 0.031407 1.035660 0.035983 HERACOMBNCEM 0.986811 0.014309 0.992071 0.017127 HERACOMBNCEP460 0.965997 0.014546 0.963482 0.013856 HERACOMBNCEP575 0.972384 0.012758 0.992599 0.011634 HERACOMBNCEP820 0.970765 0.021411 0.969164 0.021406 HERACOMBNCEP920 0.962127 0.010004 0.981842 0.008610 HERACOMBCCEM 1.002665 0.035660 0.963569 0.026441 HERACOMBCCEP 1.000500 0.033141 1.032224 0.031751 HERAF2CHARM 0.992811 0.029782 1.022785 0.033396 H1HERAF2B 1.185254 0.114555 1.186588 0.180852 ZEUSHERAF2B 1.062480 0.069183 1.021370 0.059660 DYE886R 0.988104 0.050917 1.023311 0.049690 DYE886P 1.045993 0.029969 1.029622 0.024639 DYE605 1.072305 0.062079 1.026494 0.045192 CDFZRAP 0.991138 0.044065 0.999777 0.043211 CDFR2KT 1.030743 0.056045 1.047724 0.063110 D0ZRAP 0.977952 0.035166 0.983970 0.037317 D0WEASY 0.949220 0.069425 0.894239 0.059862 D0WMASY 1.021667 0.069359 1.050197 0.069756 ATLASWZRAP36PB 1.050495 0.040940 1.070430 0.039945 ATLASZHIGHMASS49FB 1.054085 0.081315 0.826218 0.074451 ATLASLOMASSDY11EXT 0.929185 0.079650 1.019348 0.088354 ATLASWZRAP11 1.038140 0.037403 0.969782 0.035628 ATLASR04JETS36PB 1.615481 0.107966 1.563697 0.107855 ATLASR04JETS2P76TEV 1.854520 0.287730 2.018156 0.423038 ATLAS1JET11 0.986339 0.038739 0.920001 0.036310 ATLASZPT8TEVMDIST 0.864450 0.028814 0.839013 0.023975 ATLASZPT8TEVYDIST 0.373459 0.013964 0.377980 0.012422 ATLASTTBARTOT 0.962451 0.111066 0.975158 0.113739 ATLASTOPDIFF8TEVTRAPNORM 0.972642 0.060197 1.067689 0.066461 CMSWEASY840PB 0.996085 0.055115 1.009131 0.062305 CMSWMASY47FB 1.063474 0.064959 1.020560 0.060954 CMSDY2D11 0.961270 0.017661 0.940930 0.018464 CMSWMU8TEV 1.014193 0.045810 1.038297 0.038482 CMSJETS11 1.034817 0.030633 1.037965 0.032541 CMS1JET276TEV 1.195315 0.206902 1.025630 0.042650 CMSZDIFF12 0.516135 0.023477 0.515497 0.020205 CMSTTBARTOT 1.103902 0.112736 1.071359 0.117339 CMSTOPDIFF8TEVTTRAPNORM 0.993586 0.062438 0.991173 0.056748 LHCBZ940PB 0.890035 0.056323 0.867724 0.060933 LHCBZEE2FB 0.959198 0.041672 1.008486 0.043602 LHCBWZMU7TEV 0.954790 0.034337 0.984984 0.035422 LHCBWZMU8TEV 0.998025 0.034601 1.032321 0.041020 ```

They appear consistent within error

Zaharid commented 6 years ago

I am going to bet that

ATLASR04JETS2P76TEV       1.854520           0.287730  2.018156           0.423038

is not a fluke. In any case seems an incredible fluke that you got such a bad chi² for the one replica.

Zaharid commented 6 years ago

@wilsonmr What is the status here?

wilsonmr commented 6 years ago

Well I agree that those things aren't statistically the same, not really sure where to go form here. Do we want more tests of some description? Apologies, just had teaching and meetings today, progress was limited at best

wilsonmr commented 6 years ago

made some plots of the actual distributions of chi2 per dataset. Histograms and KDE plots depending on preference: https://vp.nnpdf.science/bEWJsYXdROyKmG4ZUV1tMQ==

ATLASR04JETS2P76TEV 1.854520 0.287730 2.018156 0.423038

seems to be explained by a single outlier for the big exp replicas for example

wilsonmr commented 6 years ago

@scarrazza did you manage to run the big experiment fit with more replicas?

scarrazza commented 6 years ago

Yes, this morning I have uploaded 380 replicas (fit 181115-001-sc), I am waiting for the vp2 report completion...

scarrazza commented 6 years ago

Here the report: https://vp.nnpdf.science/UwRHr5VCR7yI-SXHPt_sSQ==

I think we can consider seriously the idea that the code is fine.

wilsonmr commented 6 years ago

Great!

Zaharid commented 6 years ago

Indeed I wouldn't have said there is a problem based on this report. So maybe it was a bad fluke after all. I can't help noticing there is still a weird outlier in the down quark, but that should not indicate a bug by itself. Could we see a report with the two bigexp things?

scarrazza commented 5 years ago

Here the report comparing both bigexp fits: https://vp.nnpdf.science/KTzrle5FQGuuBdcigkDKnQ==/

Zaharid commented 5 years ago

I think this is now can be closed!