HEPData / hepdata

Repository for main HEPData web application
https://hepdata.net
GNU General Public License v2.0
41 stars 11 forks source link

records: extend `importer` module to allow bulk import from Rivet #811

Open GraemeWatt opened 3 months ago

GraemeWatt commented 3 months ago

The importer module (CLI) was written to import records from hepdata.net to a developer's local instance. It uses a list of INSPIRE IDs given at https://www.hepdata.net/search/ids?inspire_ids=true and it downloads files using a URL pattern url = "{0}/download/submission/ins{1}/original".format(base_url, inspire_id) where base_url = 'https://hepdata.net'.

The importer module should be extended to get the list of INSPIRE IDs and the download files from an alternate location, for example, a simple web directory with the INSPIRE IDs contained in the name of the files. It should also be possible to create records with any user assigned as the Coordinator (rather than just admin_user_id = 1). The ability to import only a subset of the complete list of INSPIRE IDs would be useful.

These changes should be carefully tested locally and on the QA system before importing to the production instance. Such an extension would be a quicker way of importing the 780 records obtained from Rivet than using the normal submission web interface.

See also discussion with @20DM in HEPData/hepdata_lib#229.

A list of the Rivet analyses can be seen at https://gitlab.com/hepcedar/rivet/-/issues/485 .

20DM commented 3 months ago

Thanks for this Graeme. I put the list of inspire IDs here and you'll find the 780 tarballs in the same directory. They all have a name of the form ins123456.tar.gz.

GraemeWatt commented 3 months ago

@20DM : thanks, that's great! I'll look into modifying the importer module soon.

I picked a random submission (ins2705058.tar.gz) and uploaded it to my Sandbox. Few (optional) comments for your consideration:

  1. You give http://rivet.hepforge.org/analyses#BESIII_2023_I2705058 as an additional resource. This is not strictly necessary (see submission docs) since the link will automatically be added after the record is finalised from the nightly harvesting of the analyses.json file. Moreover, the automatic link added will be http://rivet.hepforge.org/analyses/BESIII_2023_I2705058 with a / instead of a #. So if you want to keep the Rivet analysis in the submission.yaml file, better to use a link with a / instead of a #, or just remove it completely.
  2. The comment has a weird markup that is not rendered by HEPData. It looks like you are taking this from the journal abstract given by the INSPIRE record (JSON). The INSPIRE JSON also provides the arXiv abstract (second item of abstracts) that uses LaTeX markup and can be rendered by HEPData. HEPData uses the arXiv abstract from INSPIRE if possible (code). Since HEPData already stores the paper abstract (although it is only displayed if there is no comment), I don't think you need to duplicate it in the comment. So I would just use the additional information "NUMERICAL VALUES HAVE BEEN DIGITISED FROM THE PAPER." as the comment or omit the comment completely if there is nothing to add. (Another possibility is to use the Description from the Rivet .info file as the comment, but in this case it contains Beam energy must be specified as analysis option "ENERGY" when rivet-merging samples. which is not relevant to the HEPData record.)
  3. It looks like Tables 1 and 2 share a common independent variable axis, so it would make sense to combine them into one table with two dependent variables, then the "Custom Rivet identifier" would not need to be given since the YODA export would give the correct identifiers automatically. Of course, I realise that some compromises need to be made in the interest of automation, and so the best overall encoding for 780 submissions is going to be different than if each submission was prepared separately.
20DM commented 3 months ago

Thanks for the feedback, Graeme!

Re 1: Ah good point, yes it should be /. I found the version with the # is an existing HepData yaml somewhere but do not recall which one it was now. I'll remove it then altogether, seeing as you run the nightly anyway.

Re 2: Ouff, yeah that doesn't look great. Apologies for that! I didn't realise there were occasionally two abstracts - it looks like most of the Inspire IDs I've got on my list only have one in fact. I've tweaked the logic now to take the arXiv one if it's available and fall back to the Inspire one otherwise. I was already falling back to the description from the Rivet info file in the few cases (~5) where no abstract is available from Inspire. I think we definitely want to add some kind of caveat sentence to highlight that the values are digitised from the paper (or come from Rivet or whatever - happy to tweak the wording!) in order to make it clear that they weren't provided directly by the experiment. However, I wouldn't want that single sentence to suppress the abstract, which I find useful to have personally, so perhaps the duplication of the abstract in the comment is acceptable? In any case, I've replaced the tarballs with new versions using the arXiv abstract where available.

Re 3: So I actually started doing this at first, but then quickly realised that it would require rewriting several hundreds of the routines: Many of them currently "abuse" the x- and y-axis integers in the identifier to group distributions - but not necessarily in the intended way. For instance, there are cases where one would need to turn existing y groups into separate d instances because the independent axis was actually different between them, and then the x groups would need to be turned into y groups, since HepData doesn't really have non-unit x identifiers. The if-else branching in my script got a little out of hand very quickly and since I didn't really fancy re-writing and re-validating hundreds of routines, I figured it'd be easier to leave the routines as they are, and to just relabel them on-the-fly using the custom Rivet identifier instead, which comes in very handy here. Hope that's OK?

20DM commented 2 months ago

Hi Graeme, just to ping this - is there anything I can help with?

GraemeWatt commented 2 months ago

Thanks for making the changes to the tarballs. I haven't started looking at this yet, since I didn't see that it was particularly urgent, but I'll try to look into it within the next couple of months.