NNPDF / nnpdf

An open-source machine learning framework for global analyses of parton distributions.
https://docs.nnpdf.science/
GNU General Public License v3.0
30 stars 6 forks source link

Datasets should know their default configuration #226

Closed Zaharid closed 8 months ago

Zaharid commented 6 years ago

When one specifies a dataset without options in the runcard (i.e. cfactors, training fractions, systematics) it should do the right thing as opposed to the bare thing.

To that end we should have in the metadata what the default configuration is (which in turn may depend on the theory settings and whatnot, much like the cuts).

A subtlety is that we are making the runcard less reproducible (i.e. what if the default cfactors change) so there should be a mechanism for a fit to know that (perhaps by copying the metadata, and adding a mechanism to read the metadata from the result). An added subtlety on the subtlety is that there is a distinction between bugfixes (ups, we forgot the normalization cfactor) and genuine additions (we now have the elwctroweak corrections), and it is for discussion whether those should be treated the same or not. The simplest would be yes, and there are two modes, current defaults and fit defaults.

Now we only have the fit defaults, which are problematic because one has to copy the settings in all the runcards (I doubt anybody has ever written manually a runcard after the first one) and it is all to easy (and too difficult to notice) to screw up a cfactor or similar.

As an added bonus the validphys runcards (which are mostly handwritten) would get nicely simplified for most common cases.

Somewhat related to #224 and #35.

nhartland commented 6 years ago

I'm not sure about this, I'd be worried that having explicit default states other than just "no c-factors" etc would just be adding extra variables to have to bear in mind and continuously check (Is this already the case in the metadata, or do I need to also put it here?) Whereas in the current setup, although verbose, you know where everything is defined.

I like the 'single source of truth' when it comes to fit runcards.

Zaharid commented 6 years ago

The current setup is good in that it is easy to see what the configuation is from the runcard. It is bad in that it is difficult to get right. One has to either copy a previous runcard and hope that it is applicable to the new fit. In fact, I don't believe it is documented anywhere else what configuration should each data have.

And even if you can see the configuration, it doesn't help very much finding problems (which are typically discovered after staring at runcards for couple of weeks).

Finally, how frequently do we change the cfactors when we don't have the explicit goal of changing the cfactors? For things like NRM that is exactly never, and so are not really configurable. Then having them in the runcard is just an opportunity to forget them.

nhartland commented 6 years ago

And even if you can see the configuration, it doesn't help very much finding problems (which are typically discovered after staring at runcards for couple of weeks).

But wouldn't having the configuration be separated in two different places just make this harder?

Finally, how frequently do we change the cfactors when we don't have the explicit goal of changing the cfactors?

Not often (which is why previous runcards are ok documentation) but I think it's helpful to have an immediate impression as to what datasets are including what. For example in the scale variation fits its generally obvious which sets are being included with QCD C-factors at NLO because they have cfac: [QCD] next to them.

Zaharid commented 6 years ago

On Tue, Jul 3, 2018 at 2:56 PM, Nathan Hartland notifications@github.com wrote:

And even if you can see the configuration, it doesn't help very much finding problems (which are typically discovered after staring at runcards for couple of weeks).

But wouldn't having the configuration be separated in two different places just make this harder?

The assumption is that the one place with defaults you don't want to look at that often, and the things that are actually written in the runcard are exceptional and meaningful, as opposed to defaults that we copy and paste (and forget from time to time).

Finally, how frequently do we change the cfactors when we don't have the explicit goal of changing the cfactors?

Not often (which is why previous runcards are ok documentation) but I think it's helpful to have an immediate impression as to what datasets are including what. For example in the scale variation fits its generally obvious which sets are being included with QCD C-factors at NLO because they have cfac: [QCD] next to them.

This particular task would be more difficult. But then again how often do you want to know the cfactors as opposed to knowing that the dataset is right.

Then another advantage is that if the datasets know more about themselves analyzing them is inherently easier. E.g. we would be able to write

dataspecs:

dataset_input: NAME

actions_:

without having to find an NLO fit and an NNLO fit that both contain that dataset, hope it is correct, and look up the appropriate cfactors for each case.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NNPDF/nnpdf/issues/226#issuecomment-402166328, or mute the thread https://github.com/notifications/unsubscribe-auth/AFabUtM6cSxlijKNpuEdIWnlHrotwvtBks5uC3gDgaJpZM4U_qrK .

Zaharid commented 6 years ago

Not often (which is why previous runcards are ok documentation) but I think

it's helpful to have an immediate impression as to what datasets are including what. For example in the scale variation fits its generally obvious which sets are being included with QCD C-factors at NLO because they have cfac: [QCD] next to them.

Not to mention that writing a validphys action that prints that info is trivial, and in fact it already exists and is implemented in the report.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NNPDF/nnpdf/issues/226#issuecomment-402166328, or mute the thread https://github.com/notifications/unsubscribe-auth/AFabUtM6cSxlijKNpuEdIWnlHrotwvtBks5uC3gDgaJpZM4U_qrK .

Zaharid commented 6 years ago

Sorry I see how this argument was confusing

And even if you can see the configuration, it doesn't help very much finding problems (which are typically discovered after staring at runcards for couple of weeks).

I meant staring at reports.

nhartland commented 6 years ago

Yeah, I can see the argument that it makes running validphys analyses a lot easier, (which is a pretty good argument). But I don't think it makes the fits any more robust, it just exchanges one possible failure mode (copying things wrong) for another (assuming that the defaults are something else).

Zaharid commented 6 years ago

I guess the argument is that relying on the defaults is likely to be a lower risk assumption. But I guess it's up for discussion.

scarlehoff commented 8 months ago

Now default is the non-variant version. Anything else is a variant.