Use multiformat reader - Githubissues

Hi @domna, I have I started to implement the multi format reader (first focussing on the phi parser). There are multiple things that are currently working nicely already.

[x] file parsing
[x] filling template from @attrs -> had to implement a special logic so that I can keep working with my existing notation
[x] filling template from @data -> works for the average and raw_data keys, which write the entry/data/data and .../detector/raw_data/raw fields. Also works for any axisnames that are stored as coordinates of my data xarray.I write the individual scans also as DATA fields to NXdata using the get_data_dims function.

Then, I have some questions that are probably already possible for the existing solution in https://github.com/FAIRmat-NFDI/pynxtools/pull/250. Would be thankful for your input on these.

[x] filling template from @eln -> previously, I was simply writing @eln in the config file because for a given config key like /ENTRY[entry]/INSTRUMENT[instrument]/beamTYPE[beam_xray]/distance, there is a corresponding /ENTRY[entry]/INSTRUMENT[instrument]/beamTYPE[beam_xray]/distance key in the ELN dict after parsing (using the convert_dict). Thus, I did not need to give any path in the config. However, the get_eln_data requires a path str and if I simply write @eln, it doesn't work. I would probably need to write the whole path (incl. upper- and lowercase notation), which is quite cumbersome. EDIT: works now as config keys are passed to get_* methods
[x] Allow lists as config value to provide different options depending if the data exists in the specified position. As an example, for ["@attrs:title", "@eln, "no title"], the converter should check (in order) the @attrs and @eln tokens and write the respective value if it exists there. If not, it defaults to "no title". However, not it writes this list ["@attrs:title", "@eln, "no title"] into the field. EDIT: works now through https://github.com/FAIRmat-NFDI/pynxtools/pull/390

Finally, there are some things that I have no idea how to implement with the current solution:

[x] multiple detectors, analyser, etc. -> for this to work, I would need to replace these in the keys in the parse_json_config method (like for the entries: key = key.replace("/ENTRY[entry]/", f"/ENTRY[{entry_name}]/")). However, this is currently not possible. A workaround would be if we replaced them in the config file before we fill from the template (i.e. in the post_process function). I have implemented something here, but then we need to change the read function in the MultiFormatReader like I implemented here. EDIT: works as config dict is exposed to post processing
[x] In my ELN file, I would like to give the possibility to have some metadata for all entries and then some only for specific entries. I can do this now using some special logic in the eln file parsing, but I am not sure that is even possible with the NOMAD ELNs. Just wanted to get your opinion on this if it is worth working in this direction. EDIT: works, but only for XPS reader

And then, finally finally, there could be some additional functionalities added. In the XpsReader, I check that any unit in the final template is a valid pint unit, but for this I need to iterate the whole array. We could implement this earlier and make it a generic functionality in the multi format reader as well.

[x] implemented only for the XPS reader

Sorry for the long text, but these are all the issues I encountered.

filling template from @eln -> previously, I was simply writing @eln in the config file because for a given config key like /ENTRY[entry]/INSTRUMENT[instrument]/beamTYPE[beam_xray]/distance, there is a corresponding /ENTRY[entry]/INSTRUMENT[instrument]/beamTYPE[beam_xray]/distance key in the ELN dict after parsing (using the convert_dict). Thus, I did not need to give any path in the config. However, the get_eln_data requires a path str and if I simply write @eln, it doesn't work. I would probably need to write the whole path (incl. upper- and lowercase notation), which is quite cumbersome.

I see. We could also pass the config dict key down to the get_... functions. I first did it but removed it later because I had the feeling that it's not really used. But we can surely re-add it.

Allow lists as config value to provide different options depending if the data exists in the specified position. As an example, for ["@attrs:title", "@eln, "no title"], the converter should check (in order) the @attrs and @eln tokens and write the respective value if it exists there. If not, it defaults to "no title". However, not it writes this list ["@attrs:title", "@eln, "no title"] into the field.

I implemented the notation differently. You need to write @attrs,@eln,"no title:title. It might be a bit weird now that I think about it, but at least there is a mechanism for it which we can tweak to our liking.

multiple detectors, analyser, etc. -> for this to work, I would need to replace these in the keys in the parse_json_config method (like for the entries: key = key.replace("/ENTRY[entry]/", f"/ENTRY[{entry_name}]/")). However, this is currently not possible. A workaround would be if we replaced them in the config file before we fill from the template (i.e. in the post_process function). I have implemented something here, but then we need to change the read function in the MultiFormatReader like I implemented https://github.com/FAIRmat-NFDI/pynxtools/pull/388

Could something like this work: https://github.com/FAIRmat-NFDI/pynxtools-mpes/blob/25983bae61f04ad379d77fd5b335fe4759ba4766/tests/data/config_file.json#L113

I like the approach of exposing the config dict for manipulation and being able to do some post-processing on it. So let's go for your approach.

In my ELN file, I would like to give the possibility to have some metadata for all entries and then some only for specific entries. I can do this now using some special logic in the eln file parsing, but I am not sure that is even possible with the NOMAD ELNs. Just wanted to get your opinion on this if it is worth working in this direction.

One possibility I also thought about also notation wise is that we actually not write /ENTRY[entry]/... and replace it but use /ENTRY/... to denote that this is a generic concept which will be filled by the reader. This also allows that we can actually write something specific to an entry, e.g., called my_entry with /ENTRY[my_entry]/... (or give the possibility there to say only to entry number 1 or so).

But I'm not sure if this is the use case you are aiming at.

multiple detectors, analyser, etc. -> for this to work, I would need to replace these in the keys in the parse_json_config method (like for the entries: key = key.replace("/ENTRY[entry]/", f"/ENTRY[{entry_name}]/")). However, this is currently not possible. A workaround would be if we replaced them in the config file before we fill from the template (i.e. in the post_process function). I have implemented something here, but then we need to change the read function in the MultiFormatReader like I implemented https://github.com/FAIRmat-NFDI/pynxtools/pull/388

Btw. there is also the *{name1, name2} notation, which we use here. This creates multiple fields with the respective names enlisted. If there is a * in any of the sub-field values it replaces those, too. This could be helpful if you have a fixed set of detector names.

filling template from @ELN -> previously, I was simply writing @ELN in the config file because for a given config key like /ENTRY[entry]/INSTRUMENT[instrument]/beamTYPE[beam_xray]/distance, there is a corresponding /ENTRY[entry]/INSTRUMENT[instrument]/beamTYPE[beam_xray]/distance key in the ELN dict after parsing (using the convert_dict). Thus, I did not need to give any path in the config. However, the get_eln_data requires a path str and if I simply write @ELN, it doesn't work. I would probably need to write the whole path (incl. upper- and lowercase notation), which is quite cumbersome.

I see. We could also pass the config dict key down to the get_... functions. I first did it but removed it later because I had the feeling that it's not really used. But we can surely re-add it.

Yeah, I think that would be quite helpful, also for some of the other callbacks (like in the get_data_dims, I want to decide which dims to use based on whether I am writing the raw detector data or the final NXdata).

Allow lists as config value to provide different options depending if the data exists in the specified position. As an example, for ["@attrs:title", "@ELN, "no title"], the converter should check (in order) the @attrs and @ELN tokens and write the respective value if it exists there. If not, it defaults to "no title". However, not it writes this list ["@attrs:title", "@ELN, "no title"] into the field.

I implemented the notation differently. You need to write @attrs,@eln,"no title:title. It might be a bit weird now that I think about it, but at least there is a mechanism for it which we can tweak to our liking.

Can you spell this out again? There may be an error in what you wrote. I tried "/ENTRY[entry]/title":"@attrs:title,@eln,no title", but that did not work.

multiple detectors, analyser, etc. -> for this to work, I would need to replace these in the keys in the parse_json_config method (like for the entries: key = key.replace("/ENTRY[entry]/", f"/ENTRY[{entry_name}]/")). However, this is currently not possible. A workaround would be if we replaced them in the config file before we fill from the template (i.e. in the post_process function). I have implemented something here, but then we need to change the read function in the MultiFormatReader like I implemented FAIRmat-NFDI/pynxtools#388

Could something like this work: https://github.com/FAIRmat-NFDI/pynxtools-mpes/blob/25983bae61f04ad379d77fd5b335fe4759ba4766/tests/data/config_file.json#L113

I like the approach of exposing the config dict for manipulation and being able to do some post-processing on it. So let's go for your approach.

The problem for me is that depending on the measurement settings, more or less detectors may be used, which I can infer from the actual data. So I need to dynamically replace them and cannot use the *{name1, name2} notation because I don't know name1, name2 beforehand. Thus, my idea was to actually make copies of the existing fields in the config, with the detector names replaced. For example, if I have in my config file

"/ENTRY[entry]/INSTRUMENT[instrument]/DETECTOR[detector]/name":"@attrs:detector/name"
"/ENTRY[entry]/INSTRUMENT[instrument]/DETECTOR[detector]/name_link":"@link:/entry/instrument/detector/name",

the process_multiple_entities method would dynamically populate the config dict with

"/ENTRY[entry]/INSTRUMENT[instrument]/DETECTOR[my_first_detector]/name":"@attrs:my_first_detector/name",
"/ENTRY[entry]/INSTRUMENT[instrument]/DETECTOR[my_first_detector]/name_link":"@link:/entry/instrument/my_first_detector/name",
"/ENTRY[entry]/INSTRUMENT[instrument]/DETECTOR[my_second_detector]/name":"@attrs:my_second_detector/name",
"/ENTRY[entry]/INSTRUMENT[instrument]/DETECTOR[my_second_detector]/name_link":"@link:/entry/instrument/my_second_detector/name",

and remove the keys with the generic detector from the dict.

For the XPS reader, this works (after merging my PR in pynxtools). We can think if we want a generic solution for this in the MultiFormatReader or if each reader shall implement this itself.

In my ELN file, I would like to give the possibility to have some metadata for all entries and then some only for specific entries. I can do this now using some special logic in the eln file parsing, but I am not sure that is even possible with the NOMAD ELNs. Just wanted to get your opinion on this if it is worth working in this direction.

One possibility I also thought about also notation wise is that we actually not write /ENTRY[entry]/... and replace it but use /ENTRY/... to denote that this is a generic concept which will be filled by the reader. This also allows that we can actually write something specific to an entry, e.g., called my_entry with /ENTRY[my_entry]/... (or give the possibility there to say only to entry number 1 or so).

But I'm not sure if this is the use case you are aiming at.

My plan was to do it like here, have some generic ELN metadata and then some metadata specific to entry Su1s.

Do you want this /ENTRY/... notation for all renameable field in the config file? I like it, but then the reader must also give the analyser, collectioncolumn, etc. names.

Yeah, I think that would be quite helpful, also for some of the other callbacks (like in the get_data_dims, I want to decide which dims to use based on whether I am writing the raw detector data or the final NXdata).

This can be easily added. Edit: It's added in the multiformatreader PR.

Can you spell this out again? There may be an error in what you wrote. I tried "/ENTRY[entry]/title":"@attrs:title,@eln,no title", but that did not work.

If I understand my code correctly it expects something like "/ENTRY[entry]/title":"@attrs,@eln,no title:title", were it would pass the path title to the get_attrs and get_eln functions (basically it only allows a single value for all keywords). But your notation probably makes more sense here (+ is more flexible).

For the XPS reader, this works (after merging my PR in pynxtools). We can think if we want a generic solution for this in the MultiFormatReader or if each reader shall implement this itself.

I think it's fine if this is a specialised version. The MultiFormatReader is rather an aid tool to quickly build a reader for different formats. I think what we do in mpes and xps is comparably involved already (and we mainly want the config dict standardisation there).

My plan was to do it like here, have some generic ELN metadata and then some metadata specific to entry Su1s.

Do you want this /ENTRY/... notation for all renameable field in the config file? I like it, but then the reader must also give the analyser, collectioncolumn, etc. names.

This should work with /ENTRY/ and /ENTRY[Su1s]/ then. I would restrict it to ENTRY (for now at least), because we only keep track of the entry names in the reader and not the other names. Edit: I did the relevant change. So you probably need to change your config file to get all entries again.

I also update the prefix-value parsing now and it should work as you tried it, e.g., @attrs:test,@eln,no title should pass test to get_attrs, '' to get_eln and otherwise writes no title. Whitespaces are added to the paths so they should be avoided, e.g., @attrs: test ,@eln would call get_attrs(key=...,value=" test ").

Pull Request Test Coverage Report for Build 10354752021

Details

1 of 1 (100.0%) changed or added relevant line in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 100.0%

Totals
Change from base Build 9976726494:	0.0%
Covered Lines:	48
Relevant Lines:	48

FAIRmat-NFDI / pynxtools-xps

Use multiformat reader #74

Pull Request Test Coverage Report for Build 10354752021

Details

💛 - Coveralls