FAIRmat-NFDI / nexus_definitions

Definitions of the NeXus Standard File Structure and Contents
https://manual.nexusformat.org/
Other
5 stars 8 forks source link

Resolving xref issue 115-nyaml2nxdl-xref-edge-cases #116

Closed RubelMozumder closed 8 months ago

RubelMozumder commented 8 months ago

Changes:

  1. term will be accept anything in-between a space and standard
  2. Any kind of URL will be accepted.

Fixes #115

lukaspie commented 8 months ago

Thanks for fixing the original issue that I raised this morning.

In the meantime, I also tested some more edges case and I am not sure that the behavior is the one we intended. For example, the following three xrefs would convert properly:

NXarchive(NXobject):
  (NXentry)entry:
    \@index:
    title:
    experiment_identifier(NX_CHAR):
      doc:
      - |
       Test doc.
      - |                                       # Example0
       xref:   
        spec: spec0
        term: term0
        url: url0"
      - |                                       # Example1
        "    xref:
          spec: spec1
        term: term1
            url: url1"
      - |                                       # Example2
        spec: spec2
        xref:
        term: term2
        url: url2"

Especially the last one seems weird since spec comes before xref. On the other hand, the one that I do expect to work does not convert with AttributeError: 'NoneType' object has no attribute 'group'.

- |                                       # Example3
  xref:
    spec: spec3
    term: term3
    url: url3

I am therefore a bit confused as to what the rules are for using the xref feature. I thought the plan was to have it as described in example 3, i.e., you first check if xref is there and then if the other three keywords are present and properly indented. Can I just write the xref anyway I like?

RubelMozumder commented 8 months ago

Thanks for fixing the original issue that I raised this morning.

In the meantime, I also tested some more edges case and I am not sure that the behavior is the one we intended. For example, the following three xrefs would convert properly:

NXarchive(NXobject):
  (NXentry)entry:
    \@index:
    title:
    experiment_identifier(NX_CHAR):
      doc:
      - |
       Test doc.
      - |                                       # Example0
       xref:   
        spec: spec0
        term: term0
        url: url0"
      - |                                       # Example1
        "    xref:
          spec: spec1
        term: term1
            url: url1"
      - |                                       # Example2
        spec: spec2
        xref:
        term: term2
        url: url2"

Especially the last one seems weird since spec comes before xref. On the other hand, the one that I do expect to work does not convert with AttributeError: 'NoneType' object has no attribute 'group'.

- |                                       # Example3
  xref:
    spec: spec3
    term: term3
    url: url3

I am therefore a bit confused as to what the rules are for using the xref feature. I thought the plan was to have it as described in example 3, i.e., you first check if xref is there and then if the other three keywords are present and properly indented. Can I just write the xref anyway I like?

Thank you for creating such a variety of examples. here is the docs

But the last one does not make any sense because xref should be in the upper hierarchy or administrative location. xref refers that it has some amount of reference information which are spec, term and URL probably more other info may be added later. But reshuffling in spec, term and URL makes sense.

lukaspie commented 8 months ago

But the last one does not make any sense because xref should be in the upper hierarchy or administrative location. xref refers that it has some amount of reference information which are spec, term and URL probably more other info may be added later. But reshuffling in spec, term and URL makes sense.

I agree that example 2 above does not make sense, but even this one converts to

    This concept is related to term `term2`_ of the spec2 standard.
.. _term2: url2

Wouldn't we want to catch that and prevent the user from writing the nyaml like this?

RubelMozumder commented 8 months ago

But the last one does not make any sense because xref should be in the upper hierarchy or administrative location. xref refers that it has some amount of reference information which are spec, term and URL probably more other info may be added later. But reshuffling in spec, term and URL makes sense.

I agree that example 2 above does not make sense, but even this one converts to

    This concept is related to term `term2`_ of the spec2 standard.
.. _term2: url2

Wouldn't we want to catch that and prevent the user from writing the nyaml like this?

your are absolutely right. In that case we should raise and error (this is your proposal, right). Thanks for bringing it.

lukaspie commented 8 months ago

But the last one does not make any sense because xref should be in the upper hierarchy or administrative location. xref refers that it has some amount of reference information which are spec, term and URL probably more other info may be added later. But reshuffling in spec, term and URL makes sense.

I agree that example 2 above does not make sense, but even this one converts to

    This concept is related to term `term2`_ of the spec2 standard.
.. _term2: url2

Wouldn't we want to catch that and prevent the user from writing the nyaml like this?

your are absolutely right. In that case we should raise and error (this is your proposal, right). Thanks for bringing it.

I am thinking that maybe it is a good idea to enforce a certain structure:

domna commented 8 months ago

@RubelMozumder @lukaspie I made a suggestion for a new xref parsing function. Can you check and test it? Probably we need to add error catching when yaml cannot be parsed.

Edit: I added an additional test where we can parametrise different inputs. Maybe we should have one for cases where we wan't it to fail, too.

lukaspie commented 8 months ago

Thanks @domna. Exactly, what I wanted to say is that the xref part should be a valid yaml! This function works as expected for all of the examples outlined above. The tests also look good.

My proposal for the suggested use in the documentation is to use 2-space indentation. Like this, you don't need to use " in the docstrings (which you need if you start the string at the level of - |.

doc:
  - |
    Test doc.
  - | 
    xref:   
      spec: spec0
      term: term0
      url: url0
domna commented 8 months ago

Thanks @domna. Exactly, what I wanted to say is that the xref part should be a valid yaml! This function works as expected for all of the examples outlined above. The tests also look good.

My proposal for the suggested use in the documentation is to use 2-space indentation. Like this, you don't need to use " in the docstrings (which you need if you start the string at the level of - |.

I agree, maybe we should even disallow the usage of " entirely, because the xref property is inherently multiline. When we add a " it actually means that this is contained in the yaml text as is. We can simply remove this by removing the strip('"') for the clean_txt. We just have to update the docs accordingly.

2-space indentation should work. As long as all the keys are lined up any indentation should work.

However, we still have the problem if there are duplicate keys. While yaml says keys should be unique pyyaml just takes the last value. It's a known bug/feature https://github.com/yaml/pyyaml/issues/165 (the community seems to not have agreed yet whether this is a bug or a feature). But I think we could add a simple check (split to lines and check if len(flatten(xref_dict)) is the same - that way we also avoid nested keys).

domna commented 8 months ago

However, we still have the problem if there are duplicate keys. While yaml says keys should be unique pyyaml just takes the last value. It's a known bug/feature yaml/pyyaml#165 (the community seems to not have agreed yet whether this is a bug or a feature). But I think we could add a simple check (split to lines and check if len(flatten(xref_dict)) is the same - that way we also avoid nested keys).

I added some checks for edge cases. Should be all fine now (except that removing of " lets all the other tests fail) Do we want to fail if the xref docstring starts with " or do we want to allow both (current status)?

Edit: I removed the support for it, because it also introduces artifact's for other docstrings. Starting and ending " are carried over even though they should be kept according to the yaml rules.

RubelMozumder commented 8 months ago

However, we still have the problem if there are duplicate keys. While yaml says keys should be unique pyyaml just takes the last value. It's a known bug/feature yaml/pyyaml#165 (the community seems to not have agreed yet whether this is a bug or a feature). But I think we could add a simple check (split to lines and check if len(flatten(xref_dict)) is the same - that way we also avoid nested keys).

I added some checks for edge cases. Should be all fine now (except that removing of " lets all the other tests fail) Do we want to fail if the xref docstring starts with " or do we want to allow both (current status)?

Edit: I removed the support for it, because it also introduces artifact's for other docstrings. Starting and ending " are carried over even though they should be kept according to the yaml rules.

I see there are a few tests for indivisual correct and wrong one, thanks @domna. I added the artifact " around the text so that any de-indentation does not affect yaml rules, as each line contains a :, and therefore yaml user should not bother of it.

RubelMozumder commented 8 months ago

@RubelMozumder @lukaspie I made a suggestion for a new xref parsing function. Can you check and test it? Probably we need to add error catching when yaml cannot be parsed.

Edit: I added an additional test where we can parametrise different inputs. Maybe we should have one for cases where we wan't it to fail, too.

Sorry, before getting there you finished it.

domna commented 8 months ago

I see there are a few tests for indivisual correct and wrong one, thanks @domna. I added the artifact " around the text so that any de-indentation does not affect yaml rules, as each line contains a :, and therefore yaml user should not bother of it.

Doesn't matter for yaml as long as it's inside doc: |. Then basically everything is allowed.