NNPDF / nnpdf

An open-source machine learning framework for global analyses of parton distributions.
https://docs.nnpdf.science/
GNU General Public License v3.0
28 stars 6 forks source link

Define metadata fields #897

Closed Zaharid closed 6 months ago

Zaharid commented 6 years ago

There is going to be a metadata specification file replacing both the plotting files and the commondata info steucture. This is to discuss which fields is it going to have in addition to those already in the plotting specification or the info steucture.

I propose to add a bibtex bibliography file containing the references for all datasets. We use bibtex to track references in all papers and this would thus simplify the main use case for having references at all.

With that, my initial proposal goes along the lines of:

Implementer: string or list of strings The name(s) of people who implementes the data and should be asked about it.

Data reference: string or kist of strings. Bibtex keys referencong the experimental measurment.

Data source: string or list of strings URL or similar where the data has been obtained from.

~Theory reference: string or list of strings. Bibtex keys referencing the codes or calculations for the theoretical prediction.~

Extended description: text The equivalent to what currenty is in the top of the filters cc files. A description of the implementation and peculiarities of the data.

nhartland commented 6 years ago

Generally looks ok to me, although I'd say the theory reference sort of naturally lives elsewhere (in the READMEs for APPLgrids). There is something to be said for consolidating these things, but in my mind they're reasonably distinct. In the sense that I don't want to have to update the buildmaster specifications when I change the theory used to compute the FK tables with.

Zaharid commented 6 years ago

How often do we update the source of the theory calculation vs buildmaster? Do we have evidence that the experimental data is 'more constant' than the theory in enough magnitude to justify splitting things up? How would I go about using e.g. validphys to generate a table with the data and theory references like the one that appears in the paper?

Zaharid commented 6 years ago

OTOH there is enough complexity in the theory implementation to merit a specific metadata file for it, which could also have the COMPOUND specification.

One problem with that is that I am not volunteering to write those.

On 27 Nov 2017 12:38, "Nathan Hartland" notifications@github.com wrote:

Generally looks ok to me, although I'd say the theory reference sort of naturally lives elsewhere (in the READMEs for APPLgrids). There is something to be said for consolidating these things, but in my mind they're reasonably distinct. In the sense that I don't want to have to update the buildmaster specifications when I change the theory used to compute the FK tables with.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NNPDF/buildmaster/issues/42#issuecomment-347157072, or mute the thread https://github.com/notifications/unsubscribe-auth/AFabUtmHpKk_B1ejhxwXrQvkELqsbK4Oks5s6p8ogaJpZM4QqOYA .

nhartland commented 6 years ago

It definitely does get updated more often than the data, but that isn't saying very much seeing as we very, very rarely update the data.

I agree also that a theory metadata file is the natural place for the COMPOUND info. I'd rather not burden the buildmaster metadata with the peculiarities of our theory settings etc.

Zaharid commented 6 years ago

OK. So we have a theory metadata as well?

That would have as a minimum the references, COMPOUND, and some description of the usage of the external code.

One could think of also moving some apfelcomb fields but let's leave that for later.

On 27 Nov 2017 13:13, "Nathan Hartland" notifications@github.com wrote:

It definitely does get updated more often than the data, but that isn't saying very much seeing as we very, very rarely update the data.

I agree also that a theory metadata file is the natural place for the COMPOUND info. I'd rather not burden the buildmaster metadata with the peculiarities of our theory settings etc.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NNPDF/buildmaster/issues/42#issuecomment-347164365, or mute the thread https://github.com/notifications/unsubscribe-auth/AFabUoLAwclisSWu5XCrpsfEtrlGhFJvks5s6qdRgaJpZM4QqOYA .

nhartland commented 6 years ago

We already do have one to some extent, but it could certainly be organised better.

On 27 Nov 2017, at 13:21, Zaharid notifications@github.com wrote:

OK. So we have a theory metadata as well?

That would have as a minimum the references, COMPOUND, and some description of the usage of the external code.

One could think of also moving some apfelcomb fields but let's leave that for later.

On 27 Nov 2017 13:13, "Nathan Hartland" notifications@github.com wrote:

It definitely does get updated more often than the data, but that isn't saying very much seeing as we very, very rarely update the data.

I agree also that a theory metadata file is the natural place for the COMPOUND info. I'd rather not burden the buildmaster metadata with the peculiarities of our theory settings etc.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NNPDF/buildmaster/issues/42#issuecomment-347164365, or mute the thread https://github.com/notifications/unsubscribe-auth/AFabUoLAwclisSWu5XCrpsfEtrlGhFJvks5s6qdRgaJpZM4QqOYA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NNPDF/buildmaster/issues/42#issuecomment-347166109, or mute the thread https://github.com/notifications/unsubscribe-auth/AE_NDDOvkRHduEOra3AGLTgh9fxOejf4ks5s6qk-gaJpZM4QqOYA.

Zaharid commented 6 years ago

Additionally we could have default cfactors for nlo and nnlo, which would remove a frequent source of bugs in fits.

On 27 Nov 2017 13:21, "Zahari Dim" zaharid@gmail.com wrote:

OK. So we have a theory metadata as well?

That would have as a minimum the references, COMPOUND, and some description of the usage of the external code.

One could think of also moving some apfelcomb fields but let's leave that for later.

On 27 Nov 2017 13:13, "Nathan Hartland" notifications@github.com wrote:

It definitely does get updated more often than the data, but that isn't saying very much seeing as we very, very rarely update the data.

I agree also that a theory metadata file is the natural place for the COMPOUND info. I'd rather not burden the buildmaster metadata with the peculiarities of our theory settings etc.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NNPDF/buildmaster/issues/42#issuecomment-347164365, or mute the thread https://github.com/notifications/unsubscribe-auth/AFabUoLAwclisSWu5XCrpsfEtrlGhFJvks5s6qdRgaJpZM4QqOYA .

scarrazza commented 6 years ago

That would have as a minimum the references, COMPOUND, and some description of the usage of the external code.

and eventually about cfactors. I like the idea of a theory metadata.

nhartland commented 6 years ago

Yeah it's a good idea, but I think it's at a tangent to the current discussion.

scarrazza commented 6 years ago

I agree, for this issue we can probably state that theory information will be implemented somewhere else.

Zaharid commented 6 years ago

Well the discussion now is whether there is enough motivation for two sets of metadata files and how should they look like.

IMO we can withdraw the theory citation from the proposal for the experiment metadata and open an issue (in apfelcomb?) For the corresponding theory metadata.

On 27 Nov 2017 13:42, "Nathan Hartland" notifications@github.com wrote:

Yeah it's a good idea, but I think it's at a tangent to the current discussion.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NNPDF/buildmaster/issues/42#issuecomment-347170746, or mute the thread https://github.com/notifications/unsubscribe-auth/AFabUin3INR1T1JJt-sCgWuwuERODeArks5s6q4qgaJpZM4QqOYA .

nhartland commented 6 years ago

IMO we can withdraw the theory citation from the proposal for the experiment metadata and open an issue (in apfelcomb?) For the corresponding theory metadata.

Either there or in applgrids

On 27 Nov 2017 13:42, "Nathan Hartland" notifications@github.com wrote:

Yeah it's a good idea, but I think it's at a tangent to the current discussion.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NNPDF/buildmaster/issues/42#issuecomment-347170746, or mute the thread https://github.com/notifications/unsubscribe-auth/AFabUin3INR1T1JJt-sCgWuwuERODeArks5s6q4qgaJpZM4QqOYA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NNPDF/buildmaster/issues/42#issuecomment-347171768, or mute the thread https://github.com/notifications/unsubscribe-auth/AE_NDBj3_Kd84UBZ1h68Ov8bUfjirwfoks5s6q88gaJpZM4QqOYA.

scarrazza commented 6 years ago

Yes, please do that.

Zaharid commented 6 years ago

This is missing the source of the data which is quite important, and not necessarily trivial to find from the paper.

Updated it.

Zaharid commented 3 years ago

@siranipour @enocera @RosalynLP @cschwan I am wondering if we could add the required information for https://github.com/NNPDF/papers/issues/25 so that this kind of thing could be done more or less automatically in the future.

Zaharid commented 3 years ago

In practice that would mean getting some sort of csv with the relevant fields (and deciding what those are). Then adding that to the plotting files would be simple enough.

cschwan commented 3 years ago

@Zaharid : We already have that in https://github.com/NNPDF/runcards which stores the Madgraph5_aMC@NLO runcards; if you take a look at the processes in the nnpdf31_proc subdirectory, you'll find a file called metadata.txt for each process. That's a simple file containing key=value entries, which one can easily parse. The metadata is also stored in the generated PineAPPL grids, which you can extract using

pineappl info --get key grid.pineappl.lz4

and you can list all keys using

pineappl info --keys grid.pineappl.lz4

Finally you can use

pineappl info --show grid grid.pineappl.lz4

In the metadata.txt files you put any key=value combination. You can also automatically generate plots with it:

pineappl --silence-lhapdf plot grid.pineappl.lz4 NNPDF31_nlo_as_0118_luxqed NNPDF31_nlo_as_0118 CT18NNLO > plot_scipt.py && python3 plot_script.py

The generated PineAPPL grids you'll find here (there's only an old one right now): https://github.com/NNPDF/pineapplgrids

Zaharid commented 3 years ago

@cschwan Thanks. I guess if we wanted to generate tables automatically we would need to make one more indirection arxiv->bibtex, which however must be already done in the paper table. That said, we could extract things like the hpedata reference from there. Also the actual entries of the table, i.e. whether the thing is included in various other releases may be useful...

cschwan commented 3 years ago

@Zaharid I've also got an entry for the arXiv indentifier and hepdata DOI (if available, some LHCb experiments don't have it), and from inspire you can easily get the reference. If it isn't easy get the inspire page from the arXiv ID, we can't add the inspire entry as well.

If one could write a program that downloads and converts the data from hepdata into the NNPDF format, that'd be very nice (for instance we could verify that the our data implementations are up-to-date, sometimes the silently change it, if you remember the CDFZRAP story). On problem is that the data formats on hepdata seems to change a lot, but on the other hand the hepdata itself seems to be able to plot it, so it must be feasable.