[Bug]: Fields are reported as missing even though they are present in the nexus file

Contact Details

No response

What happened?

This can be tested in #203 in the tests/data/dataconverter/readers/mpes directory. When the file is converted it emits the following warnings:

dataconverter --reader mpes --nxdl NXmpes --input-file xarray_saved_small_calibration.h5 --input-file config_file.json
[info]: Path /ENTRY[entry]/PROCESS[process]/energy_calibration/coefficients not found. Skipping the entry.
The data entry corresponding to /ENTRY[entry]/data/@signal is required and hasn't been supplied by the reader.
The data entry corresponding to /ENTRY[entry]/data/data is required and hasn't been supplied by the reader.
The data entry corresponding to /ENTRY[entry]/data/energy is required and hasn't been supplied by the reader.
The data entry corresponding to /ENTRY[entry]/INSTRUMENT[instrument]/beam_TYPE/incident_energy is required and hasn't been supplied by the reader.
The data entry corresponding to /ENTRY[entry]/data/energy/@type is required and hasn't been supplied by the reader.

even though all of these fields are present in the config_file and are written to the file.

Relevant log output

dataconverter --reader mpes --nxdl NXmpes --input-file xarray_saved_small_calibration.h5 --input-file config_file.json
[info]: Path /ENTRY[entry]/PROCESS[process]/energy_calibration/coefficients not found. Skipping the entry.
The data entry corresponding to /ENTRY[entry]/data/@signal is required and hasn't been supplied by the reader.
The data entry corresponding to /ENTRY[entry]/data/data is required and hasn't been supplied by the reader.
The data entry corresponding to /ENTRY[entry]/data/energy is required and hasn't been supplied by the reader.
The data entry corresponding to /ENTRY[entry]/INSTRUMENT[instrument]/beam_TYPE/incident_energy is required and hasn't been supplied by the reader.
The data entry corresponding to /ENTRY[entry]/data/energy/@type is required and hasn't been supplied by the reader.

@lukaspie

I had a quick look at the filled template from the reader and the generated-template one for reference. It seems that you have setup the data group to have a fixed name in the NXDL file. So the generated-template has /ENTRY[entry]/data, but from your reader you fill in /ENTRY[entry]/DATA[data]. Obviously these mean the same thing. This field shouldn't be renamed according to the NXDL. That's why the generated-template has it without the square brackets. I can add a check to see if these are equals. But it can also add confusion in situations where one would try to change the hdf name in the square brackets. Also, it adds complications when there could be two groups of type data at the same level. Then in the current verification there is no way to see if either one of them are fully filled and not missing some fields, etc.

I will recommend trying to base off the generated-template as the source of truth to fill it from the reader.

I couldn't find the incident_energy in the filled template. Maybe I missed it. Let me know if it is filled by another key in the dict.

So the generated-template has /ENTRY[entry]/data, but from your reader you fill in /ENTRY[entry]/DATA[data].

Yes, we renamed that. Thank you. This helps to avoid most of the data related outputs.

I can add a check to see if these are equals. But it can also add confusion in situations where one would try to change the hdf name in the square brackets. Also, it adds complications when there could be two groups of type data at the same level. Then in the current verification there is no way to see if either one of them are fully filled and not missing some fields, etc.

I think it would be helpful to support both. Having two groups of the same class should be no problem, because we cannot distinguish two groups of the same type in nexus anyways. The only way to do this properly is by adding a lowercase part to the group name, e.g., beam_BEAM , otherwise in nexus we wouldn't know which class to take. I think for these cases we should rather have a warning, that this is bad design (but this is rather an issue for nyaml than pynxtools as this is more appdef design related).

I couldn't find the incident_energy in the filled template. Maybe I missed it. Let me know if it is filled by another key in the dict.

The incident_energy is filled for two groups, named beam_probe and beam_pump which should both match beam_TYPE. Here is the field for probe beam https://github.com/FAIRmat-NFDI/pynxtools/blob/4d469fa3ea7aa1b419a0704a8bbe86a00d67e90e/tests/data/dataconverter/readers/mpes/config_file.json#L170

and for pump beam https://github.com/FAIRmat-NFDI/pynxtools/blob/4d469fa3ea7aa1b419a0704a8bbe86a00d67e90e/tests/data/dataconverter/readers/mpes/config_file.json#L196

The current output from pynxtools is (with changed DATA[data] to data):

dataconverter --reader mpes --nxdl NXmpes --input-file xarray_saved_small_calibration.h5 --input-file config_file.json
[info]: Path /ENTRY[entry]/PROCESS[process]/energy_calibration/coefficients not found. Skipping the entry.
The data entry corresponding to /ENTRY[entry]/data/energy is required and hasn't been supplied by the reader.
The data entry corresponding to /ENTRY[entry]/INSTRUMENT[instrument]/beam_TYPE/incident_energy is required and hasn't been supplied by the reader.

The problem with the energy field is the same, because we set is as AXISNAME[energy]. This cannot really be changed nicely without a bunch of special cases because we fill this automatically from the source file, see here https://github.com/FAIRmat-NFDI/pynxtools/blob/4d469fa3ea7aa1b419a0704a8bbe86a00d67e90e/tests/data/dataconverter/readers/mpes/config_file.json#L406

The first one is some print statement from the mpes reader itself. I'll dig into this what is the problem there.

Generally, I think the problem in pynxtools boils down to the support of lower-upper-case-combinations (also for #210) for beam_TYPE or source_SOURCE and if we allow something like AXISNAME[energy] if a field energy exists in the appdef.

Because of this issue with the AppDefs, I have kept one simple rule to keep the dataconverter a bit sane. I put whatever the "concept" path is outside the square brackets and the hdf path is consisted of what's in the square brackets.

The problem I have is how do I verify if we have both required /entry/data[mydata]/req_field and /entry/data[myotherdata]/req_field. If I distill the two to their concept names, I have no way to verify children as you can then have one group with half the required fields and another with the other half. But they will both pass the verifier.

We can disable checking for these odd cases if you like. Implementing something to take this into account will bring up issues in other places.

Because of this issue with the AppDefs, I have kept one simple rule to keep the dataconverter a bit sane. I put whatever the "concept" path is outside the square brackets and the hdf path is consisted of what's in the square brackets.

The problem I have is how do I verify if we have both required /entry/data[mydata]/req_field and /entry/data[myotherdata]/req_field. If I distill the two to their concept names, I have no way to verify children as you can then have one group with half the required fields and another with the other half. But they will both pass the verifier.

I don't fully understand this. They should not distill into a single concept name but create two instances of the specialised base class data, one named mydata and the other myotherdata. Both need to be verified against the specialised base class to pass. The additional thing to do is to check for matches on the name, i.e., is there are a fixed field of name mydata we need to check against or is there a group with the partial free name myotherNAME. The composition of the hdf path name actually chooses which base class to use (so unfortunately we cannot fully separate them). My arguing is that we need this check anyways to match something like myotherNAME and checking for mydata is included in that as an edge case.

We can disable checking for these odd cases if you like. Implementing something to take this into account will bring up issues in other places.

True, I think this might also be too big of a change before the workshop, but we should add a proper implementation for this eventually. I think for the mpes examples I might also be able to formulate it such in the config file that we don't get any warning (not sure of the incident_energy, though).

FAIRmat-NFDI / pynxtools