Closed egonw closed 1 year ago
The hyphen divides the blocks, with three in total ... it looks like it matches to me?
^MSBNK-[A-Z0-9]{1,32}-**[A-Z0-9]{1,64}$**
and
MSBNK-Fac_Eng_Univ_Tokyo-JP001576
...with bold indicating first and third blocks respectively? The char count seems OK in the second block ...
Hi Egon, I can not really follow here. We have exactly two hyphen in every ACCESSION. The hyphen split the blocks MSBNK, the contributor, and a id given by the contributor. There is no hyphen allowed in the third block and I'm pretty sure we don't have any in our data.
Right. Sorry. I made the wrong conclusion indeed. The problem is the regexp fail in Wikidata: https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations/P6689#%22Format%22_violations
Violations count: 48908
[ethanol (Q153)](https://www.wikidata.org/wiki/Q153): [MSBNK-Fac_Eng_Univ_Tokyo-JP006778](https://massbank.eu/MassBank/RecordDisplay?id=MSBNK-Fac_Eng_Univ_Tokyo-JP006778)
[carbon dioxide (Q1997)](https://www.wikidata.org/wiki/Q1997): [MSBNK-Fac_Eng_Univ_Tokyo-JP001576](https://massbank.eu/MassBank/RecordDisplay?id=MSBNK-Fac_Eng_Univ_Tokyo-JP001576)
[benzene (Q2270)](https://www.wikidata.org/wiki/Q2270): [MSBNK-Fac_Eng_Univ_Tokyo-JP002103](https://massbank.eu/MassBank/RecordDisplay?id=MSBNK-Fac_Eng_Univ_Tokyo-JP002103)
[benzene (Q2270)](https://www.wikidata.org/wiki/Q2270): [MSBNK-Fac_Eng_Univ_Tokyo-JP002347](https://massbank.eu/MassBank/RecordDisplay?id=MSBNK-Fac_Eng_Univ_Tokyo-JP002347)
Right. It's the upper/lower case mismatch then. Agreed?
But we have a mistake in the regex in our documantation at a different place.
echo MSBNK-Fac_Eng_Univ_Tokyo-JP001576 | grep -E "^MSBNK-[A-Za-z0-9_]{1,32}-[A-Z0-9_]{1,64}$"
So regex should be "^MSBNK-[A-Za-z0-9_]{1,32}-[A-Z0-9_]{1,64}$"
and not "^MSBNK-[A-Z0-9_]{1,32}-[A-Z0-9_]{1,64}$"
. I will fix that. Thanks for reporting.
Thanks for reporting. Its fixed in dev and will go online soon.
The current regular expression is
^MSBNK-[A-Z0-9_]{1,32}-[A-Z0-9_]{1,64}$
(source) but this does not matchMSBNK-Fac_Eng_Univ_Tokyo-JP001576
which has a hyphen in the second block.I propose updating the regexp to
^MSBNK-[A-Z0-9_]{1,32}-[A-Z0-9_-]{1,64}$
, adding the hyphen as allowed in the second block.