FAIRmat-NFDI / nexus_definitions

Definitions of the NeXus Standard File Structure and Contents
https://manual.nexusformat.org/
Other
5 stars 8 forks source link

nyaml2nxdl xref edge cases #115

Closed lukaspie closed 8 months ago

lukaspie commented 8 months ago

When using the new xref feature, I came across an edge case when adding the following docstring:

(NXinstrument):
    doc: 
     - |
      MPES spectrometer
     - | 
      xref:
        spec: ISO 18115-1:2023
        term: 12.58
        url: https://www.iso.org/obp/ui/en/#iso:std:iso:18115:-1:ed-3:v1:en:term:12.58

which is translated to

<group type="NXinstrument">
      <doc>
           MPES spectrometer

               This concept is related to term `12.58`_ of the ISO 18115-1:2023 standard.
           .. _12.58: NO URL
      </doc>
</group>

In this case, the problem is with the following line picking up term: in the url field and then not correctly reading the URL anymore. https://github.com/FAIRmat-NFDI/nexus_definitions/blob/1016aa054582323d44cfd215685639ef9d4605b8/dev_tools/nyaml2nxdl/nyaml2nxdl_forward_tools.py#L302

Suggestion by @domna was to use regex for checking for spec, term, and url. @mkuehbach also mentioned that first it should be checked that xref is present before we check for the three sub-keywords.

Other edge cases:

domna commented 8 months ago

Thanks for creating the issue. This could be a regex to check the lines ^\s*term\: (.+), we can replace here spec and url respectively (for url we even could do a proper url check?). ~For [\t\f\v ]*$ something like [\s^\r\n]*$ might also work (not sure about the syntax though).~ I think just using \s* is fine, because we split in lines anyways or even better don't use $ (edited the regex above and below).

Edit: This could be something for url (from here):

^\s*url\:\s+(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})