HEPData / hepdata_lib

Library for getting your data into HEPData
https://hepdata-lib.readthedocs.io
MIT License
15 stars 37 forks source link

Add ability to read in YODA files #229

Open GraemeWatt opened 1 year ago

GraemeWatt commented 1 year ago

For cases where an analyser has data already in the YODA format for use with Rivet, it would be useful if hepdata_lib could read YODA files for conversion to the HEPData YAML format. It would be preferable if YODA was an optional and not mandatory dependence. The question of converting YODA to HEPData YAML has been a long-standing issue (HEPData/hepdata-converter#10), but it would be better handled by hepdata_lib than the hepdata-converter.

Cc: @20DM

20DM commented 1 week ago

Hi Graeme!

I'm just in the process of preparing submissions for the reference data files in Rivet that don't have a HepData entry yet. I'm currently struggling to use the hepdata_lib for cases with inhomogeneous error breakdowns across bins. For instance, I have a distribution with three bins where the first two bins have two error components 'A' and 'B' (but not 'C') and the third bin has error component 'C' (but not 'A' and 'B').

I know this is supported in principle, e.g. by just omitting the respective components in the dictionary. However, when using the library, it seems hepdata_lib/helpers.py raises a ValueError

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.

Is there a trick?

20DM commented 1 week ago

PS - just to be clear: of course I can "make it pass" by just setting the uncertainty to zero, but then all bins will have three uncertainty components, some of them zero, which is not the same as the bin not having the component in its breakdown to begin with. I think the problem is that the check for "non-zero uncertainties" only checks if there's at least one non-zero component and then adds all of them, regardless of their value. Can we make this more flexible?

20DM commented 1 week ago

On a different note: We have few cases where we have a discrete (string) axis where a subset of the edges is technically a floating point range. The library throws an error e.g. like this

error - independent_variable 'value' must not be a string range (use 'low' and 'high' to represent a range): '1.225-1.300' in 'independent_variables[0].values[6].value' (expected: {'type': 'number or string (not a range)'})

Of course I agree that a discrete axis where all bins are of the form float - float should just be a continuous axis and it's great that the validator enforces this. However, there are also a number of examples on HepData where we have a mix of these kind of bins with genuine discrete bins and we might want to allow this kind of axis in general, no?

One simple example I'm just looking at is one where we have two bins = [ "7 - 8", "13" ] corresponding to LHC centre-of-mass energies. One could get around the error by splitting this table into separate tables with a continuous [7.0, 8.0] bin or a discrete [ "13" ] bin, respectively, but then the two measurement points would not end up in the same plot without additional post-processing, which seems a shame. 🤔

20DM commented 6 days ago

On second thought, I suspect this requirement comes from the cases where we have a differential distribution, which is prepended/appended by a single bin corresponding to the average, which probably shouldn't be allowed. Maybe best to leave the validator as is and I will work around these cases (there's only 5 of them, so should be manageable).

GraemeWatt commented 6 days ago

This error comes from the hepdata-validator package rather than hepdata_lib. It was a common encoding mistake that uploaders specified a bin as a single value with the bin limits separated by a hyphen rather than giving separate low and high values (HEPData/hepdata-validator#33), so we implemented a check to catch it. I think hepdata_lib does not support mixed bins such as {low: 7, high: 8} and value: 13, although this is allowed in the HEPData YAML format. You could use {low: 13, high: 13} (unless a zero-width bin causes problems?) or use a separator other than - for the discrete bin "7 - 8" like "7 to 8" or "7 & 8".

20DM commented 6 days ago

Well, there were only 5 cases where I encountered this issue, so I've just replaced the dash with a "to" or "&" , depending on the context. It's sufficiently rare that this is probably good enough for now.

Good news, though: I've now managed to create submission tarballs that make the validator happy for all of the Rivet reference files that don't have a HepData entry yet. There's a total of 780 tarballs. What's the best way to submit them? I hope I don't have to upload them through the browser one by one? 😉

20DM commented 6 days ago

PS - I have a guest account for the IPPP cluster if it would be helpful for me to upload them there somewhere?

GraemeWatt commented 6 days ago

Great work! You should log into hepdata.net and click "Request Coordinator Privileges" on your Dashboard, then enter "Rivet" as the Experiment/Group. You can then click the "Submit" button to initiate a submission with an INSPIRE ID and specify an Uploader and Reviewer (maybe just yourself in both roles, unless you want a check from someone else). This will create an empty record that allows you to upload, then the record can be reviewed (there's a shortcut "Approve All Tables") and finalised from your Dashboard.

In terms of automation, we haven't yet encountered a need for bulk uploads like this, so unfortunately, there's not an easy way to finalise 780 records. The upload stage could be done from the command line (or from Python) using the hepdata-cli tool (see Example 9), but it requires an invitation cookie specific for each record. The record creation, reviewing and finalisation can only be done from the web interface. It might be possible to (semi-)automate these steps using something like Selenium, but I think that each record should undergo a basic visual check by a human before it is finalised. I suggest that you perform the create/upload/review/finalise workflow manually for a few records until you see what is involved, then you can decide whether it is worthwhile to look into writing scripts to (semi-)automate the procedure.

GraemeWatt commented 6 days ago

I've approved your Coordinator request. I realised that we already have a module for bulk imports that was written to import records from hepdata.net to a developer's local instance. Previously, we had a similar module for bulk migration of records from the old HepData site to the new hepdata.net site. The importer module bypasses the web interface of the normal submission system, so it would be a more efficient way of importing a large number of tarballs. If you could copy the tarballs to a web-accessible location and provide a list of INSPIRE IDs in a format similar to https://www.hepdata.net/search/ids?inspire_ids=true , I'll look into making the necessary changes to the importer module. I've opened a new issue HEPData/hepdata#811 so please continue the discussion there as it no longer relates to hepdata_lib.

20DM commented 6 days ago

Great - thank you!! 🙏