GenomicsStandardsConsortium / mixs

Minimum Information about any (X) Sequence” (MIxS) specification
https://w3id.org/mixs
Creative Commons Zero v1.0 Universal
38 stars 21 forks source link

allow whitespace between delimiters in `Value syntax` patterns? #465

Open turbomam opened 2 years ago

turbomam commented 2 years ago

Several literal delimiters apper in the various Value syntaxes

When validating, should we allow an arbitrary number of whitespaces around delimiters?

If not, then the Value syntaxes will be interpreted literally, with respect to padding, the way term submitters entered the term

turbomam commented 2 years ago

Note that we are not treating | as a literal delimiter. It is the or operator, within enumerations, and between Value syntax components

mslarae13 commented 4 months ago

I tried to find an issue for this, and this is the closest one I could find.

The inconsistent use of these "literal delimiters" is confusing. I propose

I see no use for / and - and x and |... i know | is often used in place of ; but they seem the same to me. & the use of | vs ; seems to have been at the discretion of the creator.

@turbomam curious on your thoughts.

turbomam commented 4 months ago

I think this analysis is a great step forward. For me, the next step is some valid and invalid data files that illustrate your positions.

I hope you, I and everybody else is clear on the current implementation: the LinkML language doesn't have any concept of a "value syntax". Same for "expected value". I made a good faith mapping of those columns in the MIxS 6.0 Google sheet to LinkML range and pattern constraints on the corresponding slots.

I think your comment above is addressing the fact that I totally punted on slots like agrochem_addition, which have pseudo-patterns for their flattened, pre-composed values.

turbomam commented 4 months ago

In the agrochem_addition example, I disagree that , is being used for elements of a list, and I think it may be hard to use the word "related" in a technical specification.

roundup, 5 milligram per liter, 2018-06-21

is pre-compsed sequence of things that would be captured in sub-slots in NMDC, like agrochem_addition.agent, agrochem_addition.dose and agrochem_addition.applciation_date.

I still like your ideas for bringing clarity to this, and hopefully we can show examples of successful and unsuccessful validation.

turbomam commented 4 months ago

As far as LinkML is concerned, | is the only acceptable character for delimiting multiple values in a multi-valued slot. In fact, in order for LinkML to parse the HACCP_term you provided out of a CSV or TSV, they would have to be rendered like this:

[tetrodotoxic poisoning[FOODON:03530249]|neurotoxic shellfish poisoning[FOODON:03530246]]

The outer square brackets are currently required. I'm not sure how the inner square brackets will be handled. I'll take responsibility for working though those examples, but hopefully @cmungall will have some thoughts to share.

But all of this would contradict your stated preference of ; for concatenating multiple values.

mslarae13 commented 2 months ago

In the agrochem_addition example, I disagree that , is being used for elements of a list, and I think it may be hard to use the word "related" in a technical specification.

roundup, 5 milligram per liter, 2018-06-21

is pre-compsed sequence of things that would be captured in sub-slots in NMDC, like agrochem_addition.agent, agrochem_addition.dose and agrochem_addition.applciation_date.

I still like your ideas for bringing clarity to this, and hopefully we can show examples of successful and unsuccessful validation.

@turbomam
Let's separate the NMDC from GSC here. NMDC cares about the different pieces of agrochem_addition because we have a database. As a standard, it's up to the institutes that implement this slot to determine how it's stored. GSC doesn't have .agent, .agent, or .application_date.

As such, I'm not sure what you're trying to get at with this for GSC. For NMDC, yes, absolutely. But, nothing GSC would do?

mslarae13 commented 2 months ago

As far as LinkML is concerned, | is the only acceptable character for delimiting multiple values in a multi-valued slot. In fact, in order for LinkML to parse the HACCP_term you provided out of a CSV or TSV, they would have to be rendered like this:

[tetrodotoxic poisoning[FOODON:03530249]|neurotoxic shellfish poisoning[FOODON:03530246]]

The outer square brackets are currently required. I'm not sure how the inner square brackets will be handled. I'll take responsibility for working though those examples, but hopefully @cmungall will have some thoughts to share.

But all of this would contradict your stated preference of ; for concatenating multiple values.

Ah! Ok, well then no ; and only use |

So...

turbomam commented 2 months ago

Doesn't that example for agrochem_addition above use a semicolon where we agreed to use a pipe?

turbomam commented 2 months ago

I agree, my mention of hypothetical NMDC sub-slots like agrochem_addition.agent, agrochem_addition.dose and agrochem_addition.applciation_date isn't directly actionable by MIxS. But getting into this discipline is the best hope MIxS has for terms like agrochem_addition becoming machine-actionable.

At this point in time they are inconsistent and unenforceable.

turbomam commented 2 months ago

I am really concerned by the apparent reality that none of the people we routinely interact with know how (or are willing to) make a valid, minimal table of samples (MimsSoil perhaps) that comply with the standard. Once a couple of people contribute in that way, we can incrementally resolve issues like the syntax and legal punctuation in agrochem_addition etc.

mslarae13 commented 2 months ago

Doesn't that example for agrochem_addition above use a semicolon where we agreed to use a pipe?

forgot to update that one. fixed