Closed scarrazza closed 5 years ago
So far, we have started the implementation all these points. @rabah-khalek could you paste here a histogram with the MC replica gen. time we have discussed last week?
So the 3.1 fit with one experiment is running in Edinburgh, and has been since Thursday. The problem is that asking for 8Gb virtual memory and enough hours for it to finish means the queue time is very slow, I still have 100 out of 150 waiting in the queue.
To feasibly run these fits we either need machines with priority time or perhaps to consider checkpointing so that we could do shorter runs, because 150 replicas looks like it's going to take over a week here
@wilsonmr how does that compare with a normal fit? By my measures, performance wasn't a concern. But does memory increase very much?
I'm sure we were running with 4Gb vmem previously, I think asking for more than 4Gb vmem does contribute to the increased queue time a lot.
That being said I think this queue time issue would apply to any global 3.1 nnlo fit at Edinburgh.
@scarrazza, sorry, I wasn't very responsive last week. I am generating the histogram of time consumption, I'll attach it here asap.
The correlated plot is to check for any inconsistencies (change in cpu usage), the arithmetic average time is 52 secs for a replica: time_consumption.pdf
For more information if needed, this was for 4295 data points in total after kinematic cuts using this run card: test_genrep.txt
@rabah-khalek great thanks. Would be great if you could generate a similar plot with PR #279 and check if the output from both codes are similar.
Here the 3.1 fit using a single experiment: https://vp.nnpdf.science/IX86w9irQA-fHWtWWFSrgQ==/ Seems acceptable in terms of performance (takes the usual 30-40h / replica) and it is statistically equivalent to the official 3.1.
Note that many actions load data from the fit directly and so are comparing exactly the same thing. The PDF plots should be fine and do look consistent. However these are worrisome:
https://vp.nnpdf.science/IX86w9irQA-fHWtWWFSrgQ==/#training-validation
You can see that we are getting less controlled uncertainties, which spills into the positivity plots:
and the replica plots.
https://vp.nnpdf.science/IX86w9irQA-fHWtWWFSrgQ==/figures/pdf_report_basespecs0_pdfscalespecs0_plot_pdfreplicas_u.pdf (see the lazy one going though the middle)
This is a full fit with 30k iterations and 100 replicas, so nothing much to say except that preprocessing and t0 may be not perfectly optimized.
Concerning the vp2 report, I noticed too that the chi2 is always identical to the base fit id. I have specified a custom pdf set for the reference fit, so I would expect that:
but the last point is not working, is that expected?
I guess the various actions in the report are fits_chi2_table
and such, which takes fits and loads PDF, theory and experiments based on them, and it is what you want for vp-comparefits
.
For more general comparisons there is e.g. ~ $ validphys --help dataspecs_chi2_table
, which will resolve the symbols within "dataspecs" however you tell it. See:
https://data.nnpdf.science/validphys-docs/guide.html#general-data-specification-the-dataspec-api
Comparison to the 1000 replicas from 3.1 NNLO: https://vp.nnpdf.science/pygXMvLRS7y4FnkZX8xzAQ==
Seems a little surprising that a single systematic can do so much damage... Note how most of the outliers in the replica plots are concentrated on the bigexp fit, despite having much fewer replicas.
If I understand correctly, there are 2 important differences when loading a single experiment:
CORR
systematics across all datasets for covmat and sampling.THEORYCORR
across the following datasets, just for the covmat:
SYSTYPE_ATLAS1JET11_10.dat:405 MULT THEORYCORR
SYSTYPE_ATLASR04JETS2P76TEV_10.dat:91 MULT THEORYCORR
SYSTYPE_ATLASR04JETS36PB_10.dat:92 MULT THEORYCORR
SYSTYPE_CDFR2KT_10.dat:26 MULT THEORYCORR
SYSTYPE_CMS1JET276TEV_10.dat:107 MULT THEORYCORR
SYSTYPE_CMSJETS11_10.dat:25 MULT THEORYCORR
I would guess that the gluon PDF changes are due to this second point, while the first point may explain larger differences. What do you think?
What are CORR systematics? Didn't we check we don't have any of these across different experiments?
I have just performed another pass, more carefully, and I think the current code is fine.
Lets try to remember the mechanism in Experiment::PullData
:
https://github.com/NNPDF/nnpdf/blob/9da35f35b704683d8d0c884f61bb0ea3860994bd/libnnpdf/src/experiments.cc#L313
total_systematics
and fSetSysMap
), including custom types, i.e. LUMI. If the systype is among CORR
, UNCORR
, THEORYUNCORR
, THEORYCORR
and SKIP
, it considers as a datapoint specific uncorrelated systype.fSys
for each datapoint using the previous list/map (we set a systype to 0 if a datapoint doesn't holds the particular type).After this step, we move towards the Experiment::GenCovMat
:
https://github.com/NNPDF/nnpdf/blob/9da35f35b704683d8d0c884f61bb0ea3860994bd/libnnpdf/src/experiments.cc#L435
fCovMat(i, i) = fStat[i] * fStat[i];
systype != SKIP
we update the diagonal with fCovMat(i,i) += diagsig + diagsignor * fT0Pred[i] * fT0Pred[i] * 1e-4;
(here we are introducing all systypes contributions for a given datapoint)systype != UNCORR & != THEORYUNCORR
we update the non-diagonal terms with all the CORR
, THEORYCORR
, and custom systypes.Could you please confirm this interpretation?
Here the comparisons between:
Conclusion: more evidence that the bigexp is potentially bugged.
Not sure where to document this but today I was trying to make a start into understanding the issues with one big exp
I was trying to run vp-setupfit
on both the runcards but edited so that they are generating pseudo data. My hope was that if the problem is in the sampling of the cov mat then if I generate some pseudo data then I could see this by examining the per replica chi2 using fit common data. Having said that vp-setupfit
is really hanging on the big experiment one so I'm not sure if it will finish. I will continue looking at this tomorrow
Ok So I did what I said I would do, however I then realised that it wasn't very useful since I essentially have a single replica, and therefore no statistics so I have no idea if the difference in chi2 is meaningful or not:
chi2 between predictions from 181023-001-sc
and level 1 data generated for many experiments is 0.9963246694831946
chi2 between predictions from 181023-001-sc
and level 1 data generated for one experiment is
1.0404018331018645
Both chi2s should probably be order 1. The disparity between one exp data chi2 and 1 is an order of magnitude greater, but I would need many more replicas to draw a conclusion.
Sampling of the covmat definitely could be source of error then, but I think it could be easier to find bug in code than construct a more elaborate way of testing it.
@scarrazza I also have a breakdown by dataset here
In particular there is a chi2 of like 6.5 for ATLASR04JETS2P76TEV
comparing to the one exp level 1 data, it's unclear if this could be evidence of the bug or just statistically what we could expect?
Thanks, so you have computed the breakdown using my both fits, right? (bigexp and 3.1master)
Yeah, this jet dataset contains the special TH*
systematics so maybe that's the problem... so we probably have a big problem in the artificial data generation.
Sorry the headings are quite complicated, I will try to explain. I got the runcards for both of your fits and then changed them minimally to be level 1 closure test runcards, then I ran vp-setupfit on these 2 new runcards (so essentially generate a single replica of pseudo data). Now I take the PDF from your 3.1master fit, although I'm thinking that actually I should use the fakepdf (NNPDF31_nnlo_as_0118) and calculate chi2 for both of my fake closure tests using fit common data. We would expect that total chi2 should be 1 if the data was generated correctly. I then also provide the same information broken down per dataset.
Does this make sense? It's a bit convoluted but the point is that the data generated for a single replica of ATLASR04JETS2P76TEV
has a much higher chi2 than one would expect (chi2 per data point of ~6 for 59 data points)
Ok, I see, yes sounds as in interesting test.
Ok, So I have slightly refined the process of generating replicas and comparing to underlying PDF. The replicas take a long time to generate (especially for big exp) but I will generated 100 replicas worth of data and produce the same information as above but with statistics, since the issue is quite subtle I think the difference won't be apparent without a decent amount of replicas. Should have results this afternoon
I have performed 2 DIS-only fits (3.1 NNLO setup):
Here the report: https://vp.nnpdf.science/UNUUK4CjRvmIlps-dXV14Q==/ Looks like differences are much smaller than a full 3.1 dataset fit.
It doesn't seem like the same thing:
On Friday I made the table which I said I would do however it occured to me that comparing chi2 per dataets is not the most sensible thing: datasets where the systematics are an important contribution to the covariance matrix have poor chi2 for both of the original two fits ran. I think generating the data as per each fit and then calculating chi2 by experiment where experiments is defined in the standard way would give a more reliable comparison.. Since it's not clear to me if we drop inter-dataset systematics if comparing the chi2s actually makes any sense
I also think it's curious that the positivity sets are very comparable between the fits if one uses robust statistics - report to follow.
Are we sure there is actually a problem? I was speaking to Luigi about it and he was curious if there was a 1000 replica fit with one experiment if we would see the same proportion of outliers in the E_valid vs E_train plot or whether those outliers we see in the 100 replica fit are just unfortunate for that run
So actually when I uses full quantile stats the positivity plots look incredibly similar:
@wilsonmr you are right that it is better to compare by experiment, but I am still curious to see the table per dataset. It is a bad idea to read too much into the per-dataset chi2 (the alphas paper is a good proof of that), but still it is interesting to see if we can find some difference when sampling. Note that the assumption is that the two things are exactly statistically equivalent and it doesn't really matter if the thing we are looking at is a bit crazy (also in practice we do get chi2~1 for the dataset chi2 in level1). So this is a long way to say that the table will be interesting.
As for the fits, I take your point that the DIS one may be bad luck, but the hadronic ones are clearly different: there are more outliers in the ~100 single exp replicas than in the 1000 original ones.
Well here is the table, I can make a neater version
They appear consistent within error
I am going to bet that
ATLASR04JETS2P76TEV 1.854520 0.287730 2.018156 0.423038
is not a fluke. In any case seems an incredible fluke that you got such a bad chi² for the one replica.
@wilsonmr What is the status here?
Well I agree that those things aren't statistically the same, not really sure where to go form here. Do we want more tests of some description? Apologies, just had teaching and meetings today, progress was limited at best
made some plots of the actual distributions of chi2 per dataset. Histograms and KDE plots depending on preference: https://vp.nnpdf.science/bEWJsYXdROyKmG4ZUV1tMQ==
ATLASR04JETS2P76TEV 1.854520 0.287730 2.018156 0.423038
seems to be explained by a single outlier for the big exp replicas for example
@scarrazza did you manage to run the big experiment fit with more replicas?
Yes, this morning I have uploaded 380 replicas (fit 181115-001-sc), I am waiting for the vp2 report completion...
Here the report: https://vp.nnpdf.science/UwRHr5VCR7yI-SXHPt_sSQ==
I think we can consider seriously the idea that the code is fine.
Great!
Indeed I wouldn't have said there is a problem based on this report. So maybe it was a bad fluke after all. I can't help noticing there is still a weird outlier in the down quark, but that should not indicate a bug by itself. Could we see a report with the two bigexp things?
Here the report comparing both bigexp fits: https://vp.nnpdf.science/KTzrle5FQGuuBdcigkDKnQ==/
I think this is now can be closed!
Lets summarize here the steps discussed in Milan in order to obtain a first prototype fit with theory uncertainties:
vp-setupfit
(#244 & #284)