NNPDF / nnpdf

An open-source machine learning framework for global analyses of parton distributions.
https://docs.nnpdf.science/
GNU General Public License v3.0
28 stars 6 forks source link

Experimental covmats are singular or not positive definite #523

Closed wilsonmr closed 6 months ago

wilsonmr commented 5 years ago

Since I am currently tasked with producing this estimator for phi I have been looking at the eigenvalues of various experiments

It comes to my attention that the covariance matrices for ~ATLAS~ and CMS have negative eigenvalues, as such I cannot diagonalise them.

EDIT: Just CMS

the offending collection of datasets is:

CMSWEASY840PB
CMSWMASY47FB
CMSDY2D11
CMSWMU8TEV
CMSJETS11
CMSZDIFF12
CMSTTBARTOT
CMSTOPDIFF8TEVTTRAPNORM

which in the fit runcard are:

  - {dataset: CMSWEASY840PB, frac: 1.0}
  - {dataset: CMSWMASY47FB, frac: 1.0}
  - {dataset: CMSDY2D11, frac: 0.5}
  - {dataset: CMSWMU8TEV, frac: 1.0}
  - {dataset: CMSJETS11, frac: 0.5}
  - {dataset: CMSZDIFF12, frac: 1.0, cfac: [NRM]}
  - {dataset: CMSTTBARTOT, frac: 1.0}
  - {dataset: CMSTOPDIFF8TEVTTRAPNORM, frac: 1.0}

I suppose I could breakdown by dataset although with correlations between datasets I'm not sure if this will identify a specific dataset or be much help

Is this a known problem? @Zaharid @enocera ? note that this is both with and without the cuts used in the fits with theory covmat and persists even with the t0 covmat AFAICT

wilsonmr commented 5 years ago

So to narrow it down I did:

exp_covs = collect(
    experiment_covariance_matrix,
    ('fit_context_groupby_experiment', 'experiments')) #otherwise we just have BIGEXP for the fit I'm looking at
exp_test = collect('experiments', ('fit_context_groupby_experiment',)) #likewise
def test_covmats(exp_covs, exp_test):
    exp_test = exp_test[0]
    for cov_tup, exp in zip(exp_covs, exp_test):
        cov, _ = cov_tup
        eig, _ = np.linalg.eigh(cov)
        if np.any(eig < 0):
            log.error(f"negative eig: {exp.name}")
        else:
            log.info(f"{exp.name} OK!")

with runcard

fit: 190315_ern_nlo_central_163_global
use_cuts: fromfit
actions_:
 - test_covmats

giving

[INFO]: All requirements processed and checked successfully. Executing actions.
[INFO]: NMC OK!
[INFO]: SLAC OK!
[INFO]: BCDMS OK!
[INFO]: CHORUS OK!
[INFO]: NTVDMN OK!
[INFO]: HERACOMB OK!
[INFO]: HERAF2CHARM OK!
[INFO]: CDF OK!
[INFO]: D0 OK!
[INFO]: ATLAS OK!
[ERROR]: negative eig: CMS
[INFO]: LHCb OK!
wilsonmr commented 5 years ago

as expected if I do the same exercise by dataset then there isn't a problem:

[INFO]: CMSWEASY840PB OK!
[INFO]: CMSWMASY47FB OK!
[INFO]: CMSDY2D11 OK!
[INFO]: CMSWMU8TEV OK!
[INFO]: CMSJETS11 OK!
[INFO]: CMSZDIFF12 OK!
[INFO]: CMSTTBARTOT OK!
[INFO]: CMSTOPDIFF8TEVTTRAPNORM OK!

so it's clearly related to correlations between datasets which makes the experiment covmat singular

enocera commented 5 years ago

And the only thing which is correlated across datasets is the luminosity uncertainty (i.e. all CMS 7 TeV data share the same lumi, and all CMS 8 TeV data share the same lumi).

wilsonmr commented 5 years ago

Ok so despite eigh giving negative values, np.linalg.inv and np.linalg.cholesky both succeed so it's just a numerical stability issue within eigh. It's quite annoying for this test that I want to run though

Zaharid commented 5 years ago

Yeah, if you use scipy.linalg instead of numpy.linalg it does give all positive numbers. See:

https://vp.nnpdf.science/nDDQcI9ESW-Cwl8PKYc0MA==

Zaharid commented 5 years ago

That said, those numbers are garbage in either case and we should investigate what they are. This is sure affecting the fit negatively.

Zaharid commented 5 years ago

I have also found numpy.linalg.eigh to be unreliable in the past.

wilsonmr commented 5 years ago

Ok I'll bear that in mind, but as you say it's weird that there are such small eigenvalues in that part of the covmat, 5.31165595e-12 according to your notebook

Zaharid commented 5 years ago

So we are getting different results (as defined by np.close) for the 110 smallest eigenvectors. That is a bit worrisome indeed:

import numpy as np

import scipy.linalg as sla

svals, svects = sla.eigh(d['CMS'])

import numpy.linalg as nla
nvals, nvects = la.eigh(d['CMS'])

np.where(~np.isclose(svals, nvals))

(array([  0,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,
         41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,
         54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,
         67,  68,  69,  71,  72,  73,  74,  75,  77,  79,  80,  81,  82,
         83,  84,  85,  86,  88,  89,  90,  91,  92,  96,  99, 101, 102,
        103, 107, 109, 110]),)

It might be that part of the problem is that we don't do any normalization of the units so the entries of the covmat end up looking very weird.

wilsonmr commented 5 years ago

Weird! I wonder if that is even confined to CMS... I only picked it out because nla.eigh gave negative, but perhaps the positive eigenvalues are also problematic

wilsonmr commented 5 years ago

Nope just counting the length of the array both ATLAS and CMS offend:

NMC: 0
SLAC: 0
BCDMS: 0
CHORUS: 0
NTVDMN: 0
HERACOMB: 0
HERAF2CHARM: 0
CDF: 0
D0: 0
ATLAS: 25
CMS: 67
LHCb: 0

Not sure why we disagree on CMS, I think you get 69 I get 67, but perhaps the tolerance is set slightly different?

Zaharid commented 5 years ago

So the smallest eigenvalues of CMS seem to be concentrated around CMSDY2D11. I did a hack to get experiments_index and then something like:

from validphys import results

from reportengine import collect

results.eindex_test  = collect('experiments_index', ("fit_context_groupby_experiment",))

ind = API.eindex_test(**inp)[0]

import pandas as pd

frame = pd.Series(svects[0], index=ind[ind.get_loc('CMS')])
#Not sure why I need dropna here...
frame[(frame > 0)].dropna()
experiment  dataset    id
CMS         CMSDY2D11  58    0.274571
                       60    0.894570
                       63    0.197922
                       64    0.172288
                       76    0.032483
                       78    0.047970
                       82    0.037216
Zaharid commented 5 years ago

The differences can come down to processor architecture.

RosalynLP commented 5 years ago

So the smallest eigenvalues of CMS seem to be concentrated around CMSDY2D11.

This was one of the data-sets removed for the theory covariance fits though, and I thought @wilsonmr originally noticed this issue with the theory covmat collection of datasets, right?

wilsonmr commented 5 years ago

urm it's still present here:

https://vp.nnpdf.science/BhTwcSdlRO63HQjgGKtO4g==/#table

RosalynLP commented 5 years ago

Yes you're right, I made a mistake when I sent the list of dropped datasets to you, and in fact we didn't drop it, well done for noticing!

wilsonmr commented 5 years ago

well I'm using the from_: fit key and the fit runcard was written by @enocera not me and so it has an exponentially higher chance of being correct! :wink:

wilsonmr commented 5 years ago

@Zaharid I don't understand frame[(frame > 0)].dropna() what if the eigenvector has high magnitude negative values, this is allowed right?

Zaharid commented 5 years ago

Yeah. Quick test, that's all.

wilsonmr commented 5 years ago

oh ok

wilsonmr commented 5 years ago

Probably at some point I should make this into a validphys report so it's more readable but here I do the same test except: I only look at eigenvectors which fail the closeness test between numpy and scipy and I do frame = pd.Series(abs(vecs[i]), index=ind[ind.get_loc(exp.name)]); print(frame.nlargest(n=10))

we can see quite easily which datasets have the largest contributions

test results

``` experiment dataset id ATLAS ATLASTOPDIFF8TEVTRAPNORM 3 0.729361 4 0.559894 9 0.245972 6 0.230264 5 0.128894 2 0.085318 7 0.084124 ATLASZPT8TEVYDIST 47 0.059196 ATLASTOPDIFF8TEVTRAPNORM 1 0.051277 ATLASZPT8TEVYDIST 45 0.036370 dtype: float64 experiment dataset id ATLAS ATLASTOPDIFF8TEVTRAPNORM 4 0.680544 6 0.538791 3 0.268504 9 0.255776 5 0.196430 7 0.163837 ATLASZPT8TEVYDIST 47 0.112769 ATLASTOPDIFF8TEVTRAPNORM 1 0.101822 0 0.090720 ATLASTTBARTOT 1 0.053731 dtype: float64 experiment dataset id ATLAS ATLASZPT8TEVYDIST 45 0.625204 44 0.584783 47 0.283722 ATLASTOPDIFF8TEVTRAPNORM 9 0.258195 ATLASZPT8TEVYDIST 46 0.157372 ATLASTOPDIFF8TEVTRAPNORM 3 0.147736 ATLASTTBARTOT 1 0.132220 ATLASZPT8TEVYDIST 41 0.128232 ATLASTOPDIFF8TEVTRAPNORM 6 0.119615 4 0.082330 dtype: float64 experiment dataset id ATLAS ATLASTOPDIFF8TEVTRAPNORM 7 0.942518 9 0.263952 2 0.114237 3 0.113829 4 0.092645 1 0.054781 ATLASZPT8TEVYDIST 41 0.035461 ATLASTTBARTOT 2 0.035167 ATLASZPT8TEVYDIST 42 0.029014 38 0.014963 dtype: float64 experiment dataset id ATLAS ATLASTTBARTOT 2 0.659035 ATLASTOPDIFF8TEVTRAPNORM 1 0.428368 3 0.312726 0 0.263765 9 0.261533 2 0.191671 4 0.161435 ATLASZPT8TEVYDIST 47 0.155820 ATLASTTBARTOT 0 0.145992 ATLASTOPDIFF8TEVTRAPNORM 6 0.119340 dtype: float64 experiment dataset id ATLAS ATLASTTBARTOT 1 0.580683 ATLASZPT8TEVYDIST 47 0.508985 ATLASTTBARTOT 0 0.338398 2 0.292550 ATLASTOPDIFF8TEVTRAPNORM 9 0.251125 3 0.210996 ATLASZPT8TEVYDIST 46 0.193405 ATLASTOPDIFF8TEVTRAPNORM 1 0.149503 5 0.123230 ATLASZPT8TEVYDIST 45 0.077980 dtype: float64 experiment dataset id ATLAS ATLASTTBARTOT 1 0.601667 2 0.591356 ATLASTOPDIFF8TEVTRAPNORM 3 0.246604 9 0.234471 ATLASZPT8TEVYDIST 46 0.210769 ATLASTOPDIFF8TEVTRAPNORM 1 0.168631 ATLASTTBARTOT 0 0.157981 ATLASTOPDIFF8TEVTRAPNORM 2 0.152068 6 0.103809 4 0.098118 dtype: float64 experiment dataset id ATLAS ATLASTTBARTOT 0 0.771088 1 0.316196 ATLASTOPDIFF8TEVTRAPNORM 2 0.288195 6 0.248058 ATLASZPT8TEVYDIST 46 0.206742 ATLASTOPDIFF8TEVTRAPNORM 9 0.178833 ATLASZPT8TEVYDIST 45 0.144793 44 0.127024 ATLASTTBARTOT 2 0.106952 ATLASTOPDIFF8TEVTRAPNORM 7 0.097075 dtype: float64 experiment dataset id ATLAS ATLASZPT8TEVYDIST 43 0.671759 44 0.352487 41 0.351525 45 0.309835 42 0.229632 ATLASTOPDIFF8TEVTRAPNORM 6 0.197957 2 0.182670 9 0.175827 ATLASZPT8TEVYDIST 46 0.103514 ATLASTOPDIFF8TEVTRAPNORM 7 0.101824 dtype: float64 experiment dataset id ATLAS ATLASZPT8TEVYDIST 43 0.720239 44 0.440413 45 0.305498 ATLASTOPDIFF8TEVTRAPNORM 6 0.213053 9 0.181908 2 0.176189 ATLASZPT8TEVYDIST 41 0.148032 42 0.142164 ATLASTOPDIFF8TEVTRAPNORM 7 0.104759 ATLASZPT8TEVYDIST 46 0.099620 dtype: float64 experiment dataset id ATLAS ATLASZPT8TEVYDIST 46 0.595793 ATLASTTBARTOT 0 0.443099 ATLASZPT8TEVYDIST 47 0.378402 45 0.224023 44 0.217291 ATLASTOPDIFF8TEVTRAPNORM 2 0.209690 9 0.177133 6 0.171249 5 0.152190 0 0.148049 dtype: float64 experiment dataset id ATLAS ATLASZPT8TEVYDIST 38 0.867534 41 0.254110 37 0.178233 39 0.174103 ATLASTOPDIFF8TEVTRAPNORM 9 0.168856 ATLASZPT8TEVYDIST 42 0.155548 40 0.148522 ATLASTOPDIFF8TEVTRAPNORM 6 0.112366 5 0.105464 2 0.064213 dtype: float64 experiment dataset id ATLAS ATLASTOPDIFF8TEVTRAPNORM 2 0.671610 5 0.532857 6 0.272255 0 0.268275 1 0.230830 9 0.159631 ATLASZPT8TEVYDIST 41 0.110974 ATLASTTBARTOT 2 0.076206 ATLASTOPDIFF8TEVTRAPNORM 3 0.063200 ATLASTTBARTOT 1 0.058646 dtype: float64 experiment dataset id ATLAS ATLASZPT8TEVYDIST 39 0.810847 40 0.328577 41 0.326724 42 0.196068 ATLASTOPDIFF8TEVTRAPNORM 9 0.155740 6 0.135116 ATLASZPT8TEVYDIST 38 0.105515 ATLASTOPDIFF8TEVTRAPNORM 5 0.100139 ATLASZPT8TEVYDIST 43 0.062699 46 0.062262 dtype: float64 experiment dataset id ATLAS ATLASZPT8TEVYDIST 40 0.864167 41 0.319651 42 0.239603 ATLASTOPDIFF8TEVTRAPNORM 9 0.152560 6 0.145007 5 0.104004 ATLASZPT8TEVYDIST 38 0.073675 45 0.071398 ATLASTOPDIFF8TEVTRAPNORM 2 0.067325 ATLASZPT8TEVYDIST 39 0.059604 dtype: float64 experiment dataset id ATLAS ATLASZPT8TEVYDIST 37 0.821406 39 0.338167 41 0.264937 42 0.177319 40 0.167301 ATLASTOPDIFF8TEVTRAPNORM 6 0.148440 9 0.145086 ATLASZPT8TEVYDIST 38 0.116319 ATLASTOPDIFF8TEVTRAPNORM 5 0.085683 2 0.054844 dtype: float64 experiment dataset id ATLAS ATLASZPT8TEVYDIST 37 0.534952 38 0.458258 39 0.438652 41 0.360762 40 0.296734 42 0.166737 ATLASTOPDIFF8TEVTRAPNORM 9 0.135516 6 0.119984 5 0.103472 ATLASZPT8TEVYDIST 43 0.065014 dtype: float64 experiment dataset id ATLAS ATLASZPT8TEVYDIST 28 0.818786 33 0.432795 29 0.297498 35 0.191022 36 0.084876 25 0.074294 ATLASTOPDIFF8TEVTRAPNORM 9 0.049256 ATLASZPT8TEVYDIST 42 0.024238 ATLASTTBARTOT 1 0.021205 ATLASZPT8TEVYDIST 47 0.018737 dtype: float64 experiment dataset id ATLAS ATLASZPT8TEVYDIST 26 0.732869 33 0.412450 28 0.386900 29 0.280207 35 0.197644 25 0.114442 36 0.088045 ATLASTOPDIFF8TEVTRAPNORM 9 0.049360 ATLASZPT8TEVYDIST 42 0.024648 ATLASTTBARTOT 1 0.022794 dtype: float64 experiment dataset id ATLAS ATLASZPT8TEVYDIST 25 0.711612 26 0.407459 33 0.403468 28 0.281427 29 0.205875 35 0.178952 36 0.082970 ATLASTOPDIFF8TEVTRAPNORM 9 0.047405 ATLASZPT8TEVYDIST 42 0.023951 ATLASTTBARTOT 1 0.021695 dtype: float64 experiment dataset id ATLAS ATLASZPT8TEVYDIST 25 0.680796 26 0.529098 33 0.318057 28 0.305061 35 0.215476 36 0.103172 ATLASTOPDIFF8TEVTRAPNORM 9 0.044539 ATLASZPT8TEVYDIST 42 0.036083 ATLASTTBARTOT 1 0.022832 ATLASZPT8TEVYDIST 47 0.015720 dtype: float64 experiment dataset id ATLAS ATLASZPT8TEVYDIST 29 0.874623 35 0.291823 33 0.284577 36 0.164803 26 0.129519 25 0.100384 28 0.085726 42 0.055806 ATLASTOPDIFF8TEVTRAPNORM 9 0.042690 ATLASZPT8TEVYDIST 41 0.032643 dtype: float64 experiment dataset id ATLAS ATLASZPT8TEVYDIST 35 0.752332 33 0.538313 36 0.299623 29 0.154844 42 0.132987 41 0.078111 ATLASTOPDIFF8TEVTRAPNORM 9 0.040591 ATLASTTBARTOT 1 0.039737 ATLASZPT8TEVYDIST 25 0.036748 47 0.017491 dtype: float64 experiment dataset id ATLAS ATLASZPT8TEVYDIST 42 0.644744 36 0.519378 41 0.386075 35 0.380993 47 0.070945 ATLASTTBARTOT 1 0.054683 ATLASZPT8TEVYDIST 40 0.045053 ATLASTOPDIFF8TEVTRAPNORM 6 0.044923 ATLASZPT8TEVYDIST 46 0.042533 37 0.040579 dtype: float64 experiment dataset id ATLAS ATLASZPT8TEVYDIST 36 0.758673 42 0.507255 41 0.336992 35 0.199823 47 0.057435 ATLASTOPDIFF8TEVTRAPNORM 6 0.051624 ATLASZPT8TEVYDIST 46 0.037025 44 0.033436 33 0.031427 40 0.028877 dtype: float64 experiment dataset id CMS CMSDY2D11 60 0.894570 58 0.274571 68 0.223066 63 0.197922 64 0.172288 78 0.047970 82 0.037216 76 0.032483 75 0.031622 CMSWEASY840PB 0 0.000000 dtype: float64 experiment dataset id CMS CMSJETS11 119 0.684794 118 0.295148 113 0.269687 117 0.255198 126 0.224697 CMSTTBARTOT 1 0.211730 CMSJETS11 114 0.207994 110 0.195084 128 0.192021 CMSZDIFF12 26 0.143498 dtype: float64 experiment dataset id CMS CMSJETS11 111 0.615271 110 0.481975 128 0.240569 CMSTTBARTOT 1 0.214990 CMSJETS11 112 0.199560 113 0.194054 122 0.192798 126 0.164907 CMSZDIFF12 26 0.148337 CMSJETS11 115 0.143043 dtype: float64 experiment dataset id CMS CMSJETS11 116 0.561971 112 0.388954 111 0.369449 117 0.316636 114 0.275778 128 0.232041 CMSTTBARTOT 1 0.202395 CMSJETS11 109 0.147717 127 0.144629 CMSZDIFF12 26 0.140320 dtype: float64 experiment dataset id CMS CMSJETS11 119 0.580527 121 0.348193 117 0.327829 113 0.291656 112 0.230297 128 0.211182 CMSTTBARTOT 1 0.188731 CMSJETS11 116 0.184696 114 0.175685 CMSZDIFF12 26 0.140842 dtype: float64 experiment dataset id CMS CMSJETS11 109 0.481118 108 0.348693 115 0.317046 112 0.307552 117 0.290315 114 0.207036 106 0.173529 CMSTTBARTOT 1 0.172898 CMSJETS11 128 0.164557 116 0.163442 dtype: float64 experiment dataset id CMS CMSJETS11 114 0.687945 116 0.460079 112 0.233563 131 0.201163 109 0.147733 CMSTTBARTOT 1 0.141909 CMSJETS11 125 0.135099 119 0.132449 128 0.123096 126 0.119547 dtype: float64 experiment dataset id CMS CMSJETS11 110 0.483794 111 0.440118 108 0.312420 131 0.309667 126 0.271517 115 0.193023 113 0.189323 121 0.175273 107 0.173958 114 0.168551 dtype: float64 experiment dataset id CMS CMSJETS11 115 0.535797 116 0.366486 112 0.352558 131 0.329331 113 0.285095 109 0.249899 126 0.223310 110 0.204619 117 0.147213 132 0.116504 dtype: float64 experiment dataset id CMS CMSJETS11 110 0.465941 112 0.381188 114 0.354697 111 0.339084 131 0.326988 116 0.255772 119 0.199950 113 0.199677 107 0.137325 132 0.125255 dtype: float64 experiment dataset id CMS CMSJETS11 109 0.651531 115 0.359724 131 0.301100 112 0.247534 110 0.223227 107 0.201082 114 0.194801 117 0.189484 113 0.121831 126 0.121496 dtype: float64 experiment dataset id CMS CMSJETS11 108 0.615954 107 0.354899 109 0.336797 106 0.329982 131 0.262044 110 0.252091 126 0.151509 115 0.122417 100 0.116045 132 0.101482 dtype: float64 experiment dataset id CMS CMSJETS11 107 0.649015 106 0.422304 108 0.365031 126 0.243935 131 0.235157 104 0.231704 98 0.103360 102 0.094068 132 0.093849 128 0.093034 dtype: float64 experiment dataset id CMS CMSJETS11 103 0.683236 104 0.410130 126 0.328400 131 0.170576 100 0.169460 98 0.168650 101 0.168400 112 0.114298 115 0.109461 128 0.106451 dtype: float64 experiment dataset id CMS CMSJETS11 103 0.484984 102 0.466594 126 0.361016 100 0.306564 107 0.240052 101 0.184616 104 0.160483 115 0.140650 112 0.132636 99 0.131994 dtype: float64 experiment dataset id CMS CMSJETS11 102 0.449170 107 0.443898 126 0.351525 100 0.336251 104 0.266161 98 0.250148 101 0.218595 106 0.186617 128 0.139691 121 0.125275 dtype: float64 experiment dataset id CMS CMSJETS11 93 0.703172 99 0.346080 125 0.296953 92 0.259708 94 0.245993 117 0.174274 101 0.132343 115 0.129301 100 0.115055 97 0.110459 dtype: float64 experiment dataset id CMS CMSJETS11 93 0.524346 99 0.501872 95 0.347810 125 0.280220 100 0.199373 92 0.194865 117 0.186832 98 0.153996 115 0.151733 105 0.139221 dtype: float64 experiment dataset id CMS CMSJETS11 105 0.742814 125 0.310604 100 0.255589 117 0.244085 99 0.228621 115 0.175565 106 0.154271 126 0.148602 116 0.135422 124 0.110745 dtype: float64 experiment dataset id CMS CMSJETS11 105 0.633842 99 0.398009 100 0.394669 125 0.275454 101 0.260149 117 0.203134 115 0.147869 116 0.125774 126 0.119876 106 0.083920 dtype: float64 experiment dataset id CMS CMSJETS11 117 0.539063 125 0.517660 124 0.371478 115 0.310730 116 0.260312 118 0.221135 126 0.171666 123 0.131975 114 0.106635 121 0.087006 dtype: float64 experiment dataset id CMS CMSJETS11 95 0.642818 97 0.495521 100 0.291842 96 0.233579 92 0.220725 125 0.161470 115 0.140694 90 0.110354 126 0.107384 106 0.093733 dtype: float64 experiment dataset id CMS CMSJETS11 97 0.576728 100 0.460299 98 0.354917 99 0.341595 94 0.217234 106 0.138285 108 0.124213 125 0.121852 102 0.118014 91 0.114435 dtype: float64 experiment dataset id CMS CMSJETS11 106 0.620143 104 0.546385 100 0.242751 108 0.242194 109 0.182888 115 0.175762 103 0.167532 99 0.129333 112 0.124451 102 0.116998 dtype: float64 experiment dataset id CMS CMSJETS11 90 0.860241 97 0.207212 99 0.179976 100 0.177680 95 0.160068 92 0.141811 102 0.122507 108 0.117984 106 0.113113 94 0.098940 dtype: float64 experiment dataset id CMS CMSJETS11 91 0.516935 94 0.483850 92 0.387652 93 0.308068 90 0.290079 106 0.195702 108 0.155440 99 0.141083 102 0.127574 100 0.127303 dtype: float64 experiment dataset id CMS CMSJETS11 94 0.555166 95 0.415776 97 0.367277 90 0.277764 93 0.230455 98 0.183790 89 0.162932 99 0.153530 102 0.152644 92 0.146676 dtype: float64 experiment dataset id CMS CMSJETS11 102 0.568467 104 0.498121 103 0.466558 108 0.244852 106 0.236249 100 0.157527 99 0.125982 126 0.109459 118 0.083401 111 0.061414 dtype: float64 experiment dataset id CMS CMSJETS11 92 0.624378 94 0.516572 95 0.316151 93 0.191494 99 0.183226 91 0.169794 90 0.143558 89 0.139160 102 0.136628 126 0.127543 dtype: float64 experiment dataset id CMS CMSJETS11 89 0.945522 126 0.122727 95 0.113988 86 0.112871 88 0.097496 87 0.088064 91 0.080298 99 0.067157 98 0.061217 97 0.056683 dtype: float64 experiment dataset id CMS CMSJETS11 87 0.703706 88 0.596004 86 0.292743 85 0.114186 126 0.093364 95 0.091961 99 0.074750 98 0.064702 108 0.052343 92 0.042197 dtype: float64 experiment dataset id CMS CMSJETS11 101 0.861930 99 0.296806 102 0.235313 98 0.190796 104 0.187595 100 0.105959 108 0.049724 93 0.049672 126 0.045552 85 0.042132 dtype: float64 experiment dataset id CMS CMSJETS11 86 0.689815 88 0.630320 87 0.282664 89 0.097650 98 0.079625 91 0.058504 104 0.057780 107 0.045684 90 0.044033 131 0.043458 dtype: float64 experiment dataset id CMS CMSJETS11 85 0.971801 86 0.128568 88 0.093784 82 0.076263 102 0.055178 108 0.053955 104 0.049990 131 0.043884 80 0.041857 89 0.039266 dtype: float64 experiment dataset id CMS CMSJETS11 91 0.795176 92 0.433008 90 0.183384 98 0.169136 94 0.161152 93 0.146175 102 0.127973 88 0.118514 104 0.106222 89 0.087007 dtype: float64 experiment dataset id CMS CMSJETS11 96 0.730436 98 0.525954 95 0.296263 102 0.150932 92 0.120665 97 0.095891 88 0.081745 104 0.080990 107 0.078600 90 0.078550 dtype: float64 experiment dataset id CMS CMSJETS11 96 0.614437 98 0.572195 97 0.442722 92 0.132950 94 0.115931 91 0.106367 104 0.095446 95 0.093549 107 0.092698 88 0.074344 dtype: float64 experiment dataset id CMS CMSJETS11 82 0.950954 80 0.257360 81 0.124004 83 0.063108 85 0.051430 88 0.035520 126 0.027949 104 0.023463 102 0.021610 79 0.018033 dtype: float64 experiment dataset id CMS CMSZDIFF12 26 0.437179 23 0.355098 CMSTOPDIFF8TEVTTRAPNORM 0 0.351623 CMSZDIFF12 14 0.347388 CMSTOPDIFF8TEVTTRAPNORM 4 0.308711 6 0.292405 8 0.262261 CMSZDIFF12 15 0.192447 17 0.179234 CMSTOPDIFF8TEVTTRAPNORM 5 0.173230 dtype: float64 experiment dataset id CMS CMSZDIFF12 14 0.529791 15 0.355548 CMSTOPDIFF8TEVTTRAPNORM 0 0.339717 6 0.293359 4 0.286667 8 0.261370 CMSZDIFF12 17 0.240950 10 0.213810 16 0.180069 CMSTTBARTOT 1 0.175105 dtype: float64 experiment dataset id CMS CMSZDIFF12 22 0.605559 CMSTTBARTOT 1 0.348985 CMSTOPDIFF8TEVTTRAPNORM 0 0.317281 6 0.301393 CMSZDIFF12 17 0.276574 CMSTOPDIFF8TEVTTRAPNORM 8 0.263771 CMSZDIFF12 23 0.219103 CMSTOPDIFF8TEVTTRAPNORM 4 0.214468 CMSZDIFF12 18 0.192254 CMSTOPDIFF8TEVTTRAPNORM 5 0.115554 dtype: float64 experiment dataset id CMS CMSZDIFF12 26 0.555399 23 0.436466 CMSTOPDIFF8TEVTTRAPNORM 6 0.300060 CMSZDIFF12 15 0.278502 CMSTOPDIFF8TEVTTRAPNORM 8 0.265105 CMSZDIFF12 22 0.251316 CMSTTBARTOT 1 0.245425 CMSZDIFF12 14 0.209796 16 0.199282 17 0.170654 dtype: float64 experiment dataset id CMS CMSZDIFF12 15 0.502389 CMSTOPDIFF8TEVTTRAPNORM 0 0.480072 CMSZDIFF12 17 0.413471 CMSTOPDIFF8TEVTTRAPNORM 6 0.271904 8 0.269075 CMSZDIFF12 18 0.217422 22 0.194309 16 0.180680 CMSTOPDIFF8TEVTTRAPNORM 2 0.178709 CMSTTBARTOT 1 0.119885 dtype: float64 experiment dataset id CMS CMSZDIFF12 15 0.547231 14 0.407995 CMSTOPDIFF8TEVTTRAPNORM 4 0.327714 8 0.270136 CMSZDIFF12 10 0.244142 23 0.240841 CMSTOPDIFF8TEVTTRAPNORM 0 0.209225 6 0.196883 CMSZDIFF12 26 0.172745 CMSTOPDIFF8TEVTTRAPNORM 2 0.158824 dtype: float64 experiment dataset id CMS CMSZDIFF12 17 0.427460 CMSTOPDIFF8TEVTTRAPNORM 4 0.412608 CMSZDIFF12 14 0.334090 10 0.290624 16 0.278386 CMSTOPDIFF8TEVTTRAPNORM 8 0.270073 CMSZDIFF12 23 0.244983 CMSTTBARTOT 1 0.239512 CMSTOPDIFF8TEVTTRAPNORM 5 0.216522 CMSZDIFF12 15 0.162464 dtype: float64 experiment dataset id CMS CMSZDIFF12 22 0.538167 17 0.399490 CMSTOPDIFF8TEVTTRAPNORM 0 0.384947 4 0.311500 5 0.271353 8 0.265109 CMSZDIFF12 10 0.225297 16 0.188953 18 0.166803 5 0.096068 dtype: float64 experiment dataset id CMS CMSZDIFF12 18 0.417348 23 0.373494 22 0.355702 26 0.321935 CMSTTBARTOT 1 0.290788 CMSTOPDIFF8TEVTTRAPNORM 8 0.256607 5 0.239285 CMSZDIFF12 10 0.227088 CMSTOPDIFF8TEVTTRAPNORM 2 0.214231 CMSZDIFF12 5 0.173533 dtype: float64 experiment dataset id CMS CMSZDIFF12 18 0.693178 17 0.293994 CMSTOPDIFF8TEVTTRAPNORM 2 0.283645 8 0.247326 CMSZDIFF12 12 0.243703 CMSTOPDIFF8TEVTTRAPNORM 6 0.236848 CMSZDIFF12 23 0.195026 10 0.180075 14 0.160953 CMSTOPDIFF8TEVTTRAPNORM 5 0.157173 dtype: float64 experiment dataset id CMS CMSZDIFF12 16 0.639854 10 0.401524 CMSTOPDIFF8TEVTTRAPNORM 6 0.260082 CMSZDIFF12 14 0.246796 15 0.239886 CMSTOPDIFF8TEVTTRAPNORM 8 0.237526 2 0.220996 CMSZDIFF12 12 0.182211 CMSTOPDIFF8TEVTTRAPNORM 4 0.149826 5 0.118921 dtype: float64 experiment dataset id CMS CMSZDIFF12 16 0.482468 23 0.296335 12 0.275501 CMSTOPDIFF8TEVTTRAPNORM 0 0.267438 6 0.265399 CMSZDIFF12 10 0.258125 5 0.248619 17 0.240528 14 0.232880 CMSTOPDIFF8TEVTTRAPNORM 8 0.226126 dtype: float64 experiment dataset id CMS CMSZDIFF12 5 0.632875 12 0.443086 CMSTOPDIFF8TEVTTRAPNORM 2 0.349340 CMSJETS11 132 0.286507 CMSTOPDIFF8TEVTTRAPNORM 6 0.250496 8 0.182052 CMSZDIFF12 16 0.146722 1 0.136411 CMSTOPDIFF8TEVTTRAPNORM 4 0.121129 5 0.113774 dtype: float64 experiment dataset id CMS CMSJETS11 132 0.774595 CMSTOPDIFF8TEVTTRAPNORM 2 0.367270 CMSJETS11 131 0.234541 CMSTOPDIFF8TEVTTRAPNORM 6 0.218246 5 0.210962 8 0.158280 CMSZDIFF12 23 0.125652 CMSTOPDIFF8TEVTTRAPNORM 0 0.123822 CMSJETS11 130 0.120472 CMSZDIFF12 12 0.095445 dtype: float64 experiment dataset id CMS CMSZDIFF12 5 0.447283 1 0.394500 CMSJETS11 132 0.382421 CMSTOPDIFF8TEVTTRAPNORM 2 0.318930 5 0.316135 CMSZDIFF12 10 0.236433 CMSTOPDIFF8TEVTTRAPNORM 6 0.172129 CMSZDIFF12 14 0.161333 16 0.157715 23 0.145002 dtype: float64 experiment dataset id CMS CMSZDIFF12 1 0.768704 CMSTOPDIFF8TEVTTRAPNORM 5 0.361236 CMSJETS11 131 0.308623 CMSTOPDIFF8TEVTTRAPNORM 4 0.229999 CMSJETS11 130 0.220361 132 0.136475 CMSTOPDIFF8TEVTTRAPNORM 6 0.114422 8 0.110225 CMSZDIFF12 12 0.090354 23 0.068534 dtype: float64 experiment dataset id CMS CMSJETS11 130 0.842278 CMSTOPDIFF8TEVTTRAPNORM 5 0.357166 4 0.210651 CMSJETS11 129 0.202779 131 0.182773 CMSTOPDIFF8TEVTTRAPNORM 2 0.138939 8 0.089688 6 0.070925 CMSZDIFF12 1 0.054843 10 0.038118 dtype: float64 experiment dataset id CMS CMSJETS11 129 0.781462 127 0.381998 CMSTOPDIFF8TEVTTRAPNORM 5 0.329620 2 0.250143 4 0.169071 CMSZDIFF12 1 0.091505 CMSJETS11 131 0.069106 CMSTOPDIFF8TEVTTRAPNORM 8 0.068775 CMSZDIFF12 17 0.058924 CMSTOPDIFF8TEVTTRAPNORM 0 0.050878 dtype: float64 experiment dataset id CMS CMSJETS11 129 0.528579 128 0.373021 127 0.351301 130 0.346663 CMSTOPDIFF8TEVTTRAPNORM 2 0.324293 5 0.298213 CMSZDIFF12 1 0.241912 CMSTOPDIFF8TEVTTRAPNORM 4 0.158944 CMSZDIFF12 5 0.093819 CMSTOPDIFF8TEVTTRAPNORM 0 0.081572 dtype: float64 experiment dataset id CMS CMSJETS11 81 0.785720 79 0.333817 86 0.272322 87 0.266200 74 0.165125 83 0.164100 88 0.161656 76 0.117783 82 0.098749 78 0.079445 dtype: float64 experiment dataset id CMS CMSJETS11 83 0.564074 80 0.518500 81 0.319363 86 0.252971 87 0.252014 79 0.244207 74 0.179872 76 0.153282 88 0.151775 82 0.135803 dtype: float64 experiment dataset id CMS CMSJETS11 83 0.731696 81 0.362587 86 0.278393 87 0.272143 80 0.250084 79 0.194767 88 0.170383 74 0.141138 76 0.132361 85 0.073079 dtype: float64 experiment dataset id CMS CMSJETS11 79 0.759115 83 0.320850 86 0.242213 87 0.236703 78 0.229339 75 0.226356 81 0.203479 88 0.152578 80 0.133171 73 0.076604 dtype: float64 experiment dataset id CMS CMSJETS11 75 0.568140 74 0.490505 76 0.469290 79 0.291540 87 0.182074 86 0.176561 73 0.162501 78 0.125447 88 0.111659 81 0.065124 dtype: float64 experiment dataset id CMS CMSJETS11 69 0.653047 70 0.540514 71 0.381274 72 0.216240 73 0.153546 80 0.143565 74 0.139116 67 0.079736 81 0.055348 68 0.052814 dtype: float64 experiment dataset id CMS CMSJETS11 68 0.595575 67 0.499752 69 0.425760 70 0.291748 71 0.161628 74 0.146370 73 0.142719 66 0.139970 80 0.134944 72 0.069073 dtype: float64 experiment dataset id CMS CMSJETS11 67 0.693076 68 0.600026 73 0.179162 74 0.149444 70 0.141062 80 0.138901 66 0.138772 71 0.111714 69 0.081969 64 0.078438 dtype: float64 experiment dataset id CMS CMSJETS11 61 0.685192 62 0.611203 67 0.202857 69 0.151393 71 0.135770 65 0.127787 64 0.115957 73 0.105968 68 0.100394 74 0.085009 dtype: float64 experiment dataset id CMS CMSJETS11 61 0.718928 62 0.539290 67 0.208726 64 0.177741 69 0.148240 65 0.144395 71 0.124352 63 0.114497 68 0.111906 66 0.099507 dtype: float64 ```

Zaharid commented 5 years ago

Right. Seems there was some point to my talk at Buffalo after all. I can think of two kinds of reasons for this problem: The one discussed in the talk, where we have a combination of small stats and assumptions on high correlations and then possibly the fact that covmats with wildly different units are perhaps more unstable than they need to be.

wilsonmr commented 5 years ago

before I forget, I think this is possibly only an issue at NLO, at NNLO we essentially regulate the covmat with the sys 10 business, but I need to actually check this..

Zaharid commented 5 years ago

Don't think that would affect the cms dy data, say.

On Fri, 2 Aug 2019, 09:37 wilsonmr, notifications@github.com wrote:

before I forget, I think this is possibly only an issue at NLO, at NNLO we essentially regulate the covmat with the sys 10 business, but I need to actually check this..

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NNPDF/nnpdf/issues/523?email_source=notifications&email_token=ABLJWURFVGQEZXRX7435JSTQCPW3HA5CNFSM4IH4LLNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3NCOMQ#issuecomment-517613362, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLJWUXXYCHNAZ75T56DGOTQCPW3HANCNFSM4IH4LLNA .

wilsonmr commented 5 years ago

And the only thing which is correlated across datasets is the luminosity uncertainty (i.e. all CMS 7 TeV data share the same lumi, and all CMS 8 TeV data share the same lumi).

@enocera could you point me towards a reference which discussions exactly how the intraexperiment systematics are determined, is any part of it a recipe at our end or is it entirely dictated by experimentalists?

wilsonmr commented 5 years ago

Don't think that would affect the cms dy data, say.

Yeah I guess you're right, it was just a thought

enocera commented 5 years ago

@enocera could you point me towards a reference which discussions exactly how the intraexperiment systematics are determined, is any part of it a recipe at our end or is it entirely dictated by experimentalists?

@wilsonmr Do you mean a reference in general or a reference specific to CMSDY2D11?

wilsonmr commented 5 years ago

If a general one exists that would be good, You seemed to identify quickly about the lumi and I was wondering if this was discussed in general at some point or if you just had seem the implementations of most of these datasets?

With the datasets I recently added there was just the covariance matrix for that dataset so I have no intuition for the systematics correlated across datasets other than I know they exist and always just accepted them as absolute truth

Zaharid commented 5 years ago

@wilsonmr I'd say the general method is ill defined guess work. It is explained in some detail here:

https://www.slac.stanford.edu/econf/C030908/papers/TUAT004.pdf

Zaharid commented 5 years ago

(note that we do even more ill defined things to define the theory covmat).

wilsonmr commented 5 years ago

yeah I was looking in buildmaster and I found roughly what I was looking for:

// Luminosity Uncertainty
      // CMS Luminosity Uncertainty, 2012 data set: 2.6%
      // arXiv:1504.03511v2
        fSys[i][fNSys-1].mult = 2.6;
        fSys[i][fNSys-1].add  = fData[i]*fSys[i][fNSys-1].mult/100;
        fSys[i][fNSys-1].type = MULT;
        fSys[i][fNSys-1].name = "CMSLUMI12";

I'll also take a look at the reference, thanks Zahari

wilsonmr commented 5 years ago

Surely a multiplicative luminosity uncertainty is a recipe for disaster (even if it is what the paper quotes although I followed their reference and the original source says 2.5% but whatever) because if you have two datasets with dimensionful points where the central values are orders of magnitude apart this will cause issues?

Zaharid commented 5 years ago

Not sure I see why. Having a completely correlated uncertainty is fine. The problem is when all relevant uncertainties are completely correlated.

There is however the general issue with the units, which I don't completely understand yet (particularly how do different magnitudes in the covmat affect stability).

enocera commented 5 years ago

The recommendation to correlate the luminosity uncertainties across bins of different data sets in the same experiment (that follow from the same experimental run) was formulated by Nathan when we were working towards NNPDF3.1. At that time we were not yet working in terms of github PRs and issues, so I cannot point out an issue to you where this was specifically discussed. It seems perfectly reasonable to me, though. Nathan explicitly enforced this "prescription" in the code. I have to admit that I never checked if labels are actually correct for the NNPDF3.1 experiments (but knowing Nathan I bet they are), as I was not in charge of coordinating the data implementation at that time. Since then, I just made sure that the appropriate luminosity label is used for new experiments.

The collider luminosity and /or other normalisation uncertainties are the only cases of (almost always MULT) uncertainties correlated across data sets: apart from the LHC lumi, you can find examples pretty much everywhere in DIS experiments (HERACOMB, SLAC, BCDMS, CHORUS, ...). Again: note that correlations are only across bins of different data sets in the same experiment (e.g. SLACP and SLACD have intra-set exp correlations, but e.g. SLACP and NMC don't).

There is not a general rule to correlate uncertainties across data sets. What one usually does is to carefully read the experimental paper and try to retrieve the relevant information from there on an exp by exp basis. This leads to a lot of ambiguity, though, since the information contained in exp papers is not always clear/complete, especially if it is provided in terms of a breakdown of systematic uncertainties (which is most common) instead of in terms of a covariance matrix (which is less common). Our interpretation usually relies on a combination of "good sense" and "common lore" (what @Zaharid would call "ill-defined guesswork"). In principle, the code allows us to be flexible in such an interpretation, in that we can define different sys files corresponding to different interpretations.

A final small remark: the CMS experiment that you implemented is provided with a covariance matrix. We didn't care about identifying the lumi uncertainty contribution to the cov matrix (and to assigning a custom label to it) because it's the only CMS data set at 13 TeV which will be included in NNDPF4.0. But we need to keep this in mind for the future, when other data at 13 TeV from CMS will be included.

enocera commented 5 years ago

@wilsonmr I was forgetting: if the original source said 2.5% and we implemented 2.6%, this is a bug and must be corrected. Period. Of course this won't change the final picture at all, but still...

wilsonmr commented 5 years ago

I will open a relevant issue, I might just be unable to read