GenomicsStandardsConsortium / mixs

Minimum Information about any (X) Sequence” (MIxS) specification
https://w3id.org/mixs
Creative Commons Zero v1.0 Universal
38 stars 21 forks source link

Validate all regexes prior to release #94

Open cmungall opened 3 years ago

cmungall commented 3 years ago

It's not totally clear what regex syntax is implemented - POSIX? Perl? There are some custom aspects, e.g. {unit}.

With a specification of the regex syntax we can implement a checker

There are some things that are likely typos in mixs5 that could be fixed:

erroneous inclusion of empty strings (double |):

wall_texture: [crows feet|crows-foot stomp||double skip|hawk and trowel|knockdown|popcorn|orange peel|rosebud stomp|Santa-Fe texture|skip trowel|smooth|stomp knockdown|swirl]

additional quotation marks:

door_type_wood: "[bettened and ledged|battened|ledged and braced|battened|ledged and framed|battened|ledged, braced and frame|framed and paneled|glashed or sash|flush|louvered|wire gauged]"

Also I think there is a typo here, should be battened not bettened

cmungall commented 3 years ago

I note that when MIxS is rendered as XML for ENA, the pseudo-regexes are expanded into true regexes. Does anyone know the process here? Is this manual or is there a script to convert?

I'm investigating to see if ENA repairs the regexes above, but these fields don't seem to appear in the built environment template?

https://www.ebi.ac.uk/ena/browser/view/ERC000031

josieburgin commented 3 years ago

Hi @cmungall, this process is manual on the ENA side. Regexes are added manually for fields that are numeric or must fit a particular format (e.g. collection date) and lists of 'text choices' are added manually for fields where the user should select from a list of options. If we see typos during implementation we would feed this back to GSC.

Like you say, it seems that these terms have not been added into the ENA implementation of the MIxS built environment package. I will raise this with the team and get back to you as to why these are missing.

ramonawalls commented 3 years ago

Validation is outside the scope of the standard and is carried out by the repository. That said, with our move to linkml format for MIxS, we will be able to offer some validation tools (that was an original goal of mixs as rdf). This is not an issue for the MIxS6 release, so I am removing it from this project.

We need to merge the mixs as rdf work with this main mixs repo. This issue can be picked up afterwards.

cmungall commented 2 years ago

We can look to centralizing this when we make the linkml. It may be useful to keep both a pseudo regex field and a true regex field.

We should centralize the rules for making true regexes from the psuedo regexes, e.g.