FAIRmat-NFDI / pynxtools

https://fairmat-nfdi.github.io/pynxtools/
Apache License 2.0
13 stars 8 forks source link

Improve search for fields when they are not directly in the appdef #386

Open domna opened 4 months ago

domna commented 4 months ago

When we try to found out if a key is documented in any of the parent appdefs or base classes we check all children names and then try to namefit the best match. However, this part does not into account probable NX classes provided with the dict. get_all_direct_children_names has an nx_class parameter now which can be used.

A second problem is that concept may be shielded. We had this for NXmpes/NXsample/temperature(NXenvironment) and NXsample/temperature(NX_FLOAT), which in principle should both be visible but get_all_direct_children_names is just a set with names so it only shows the one closest to the appdef.

This is the code part which needs to be updated: https://github.com/FAIRmat-NFDI/pynxtools/blob/e0d3184f1892d13042768531214f6e03b0710212/src/pynxtools/dataconverter/validation.py#L512-L513

Probably it's best to search through the inheritance chain with lxml directly, instead of using get_all_direct_children_names in combination with search_add_child_for, which just adds the node based on name. With lxml the node should be searched, namefitted and the NexusNode shall be created with add_node_from, which takes an xml element directly (hence no ambiguity). It is a similar technique to what I employed for the sibling namefitting: https://github.com/FAIRmat-NFDI/pynxtools/blob/e0d3184f1892d13042768531214f6e03b0710212/src/pynxtools/dataconverter/nexus_tree.py#L634

cc @RubelMozumder @mkuehbach (as new maintainers of the validation part)

rettigl commented 4 months ago

I generally don't understand how such an algorithm should find out the intended match. See e.g. DATA and AXISNAME in NXdata. As soon as there are more than one unnamed Field/Group, the match cannot be unique, no? As this can lead to wrong behavior then, I would suggest to not do any automatic concept fitting, but always require naming the concept for unnamed fields. Otherwise, how can we catch e.g. a typo of a named field? I noticed this as after renaming "bias" to "bias_env" in NXsample of NXmpes, there was no complaint for the old "bias" field, but it was namefitted as an NXbeam, which does not make any sense. I propose this has to raise a warning instead. Also, this wrongly namefitted group did not have any NXclass attribute.

domna commented 4 months ago

I generally don't understand how such an algorithm should find out the intended match.

I agree in most cases this is problematic and this should also be reflected in the changes to the parts above.

See e.g. DATA and AXISNAME in NXdata.

This ones are even more special as we do not even have an NXclass attribute available. However, there is a special treatment for NXdata which associates this concepts based on the @signal and @axes attributes. Generally, resolving fields associated to ambiguous uppercase groups is not really possible in nexus (if we don't have the special treatment as for NXdata). We could resolve this on the writing side by using an uppercase notation there, too. This would, however, create problems on the reading side as the reader would not have the same information available and hence the association of the concept will be arbitrary (i.e. based on the reader implementation - but there is no "right" way to figure it out).

Also, this wrongly namefitted group did not have any NXclass attribute.

For this we should remove that the algorithm tries to namefit if it doesn't find a fitting exact name and only allow association based on uppercase parts in the path. It eventually turned out that is a design flaw that the algorithm even tries (as there is no benefit, because we almost always have multiple upper case concepts we could fit to).