biopragmatics / bioregistry

📮 An integrative registry of biological databases, ontologies, and nomenclatures.
https://bioregistry.io
MIT License
114 stars 49 forks source link

Update Protein Ontology regex specification #959

Closed nataled closed 11 months ago

nataled commented 11 months ago

Prefix

pr

Explanation

The given regex in Bioregistry is incomplete. In general there are two types of local identifiers in PR:

The full regex is thus: ^(\d+|[OPQ][0-9][A-Z0-9]{3}0-9?|[A-NR-Z]0-9{1,2}(-\d+)?)$

Contributor ORCID

0000-0001-5809-9523

cthoyt commented 11 months ago

Awesome, thanks @nataled.

nataled commented 11 months ago

The regex given above is from the OBO PURL system, but I can see that it is not as precise as the one I use internally:

^(?:\d{9}|[OPQ][0-9][A-Z0-9]{3}0-9?|[A-NR-Z]0-9{1,2}(?:-\d+)?)$

This version specifies the allowed number of digits (nine) and also specifies that the parentheses are not intended for string capture. Perhaps this version will work better?

cthoyt commented 11 months ago

Yes, this one appears to work, but what is going on with the colons?

nataled commented 11 months ago

The ?: at the beginning of each parenthetical expression says "don't save the contents of the parentheses match". Those parenthetical expressions are there only because they represent optional parts. Taught to me and recommended by James Overton.

nataled commented 11 months ago

I noticed that the first regex appears to have links in it. When I did a copy/paste they showed up.