biopragmatics / bioregistry

📮 An integrative registry of biological databases, ontologies, and nomenclatures.
https://bioregistry.io
MIT License
114 stars 49 forks source link

Update PR regex #960

Closed cthoyt closed 11 months ago

cthoyt commented 11 months ago

Closes #959

@nataled can you provide a few more example local unique identifiers? I guess that the regex you supplied also includes UniProt isoforms

cthoyt commented 11 months ago

@nataled it also appears that P0DP23 does not match your regex, is this because it only applies to a subset of UniProt?

nataled commented 11 months ago

In addition to the one you already have (a nine-digit string), there are the following four examples: Q15796 Q15796-1 A0A0G2QC33 A0A0G2QC33-2

I do not see how P0DP23 fails the regex:

^(\d+ covers the pure digit part
[OPQ][0-9][A-Z0-9]{3}[0-9](-\d+)? covers the six-place UniProtKB accessions
[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}(-\d+)?)$ covers the ten-place UniProtKB accessions

Thus we need to look only at the six-place part:

[OPQ][0-9][A-Z0-9]{3}[0-9](-\d+)?
[OPQ][0-9][A-Z0-9][A-Z0-9][A-Z0-9][0-9](-\d+)?   (expanding the {3} for ease of aligning)
  P    0    D       P          2    3
cthoyt commented 11 months ago

It appears the other ones also fail against this regex

nataled commented 11 months ago

What's crazy is that I pulled the regex from the OBO Foundry PURL system. In that context they do not fail.

nataled commented 11 months ago

I just tried it with a small script and they all pass. How are you assessing the pass/fail?

cthoyt commented 11 months ago

The test run with Python's builtin unittest module's https://docs.python.org/3/library/unittest.html#unittest.TestCase.assertRegex. This apparently uses re.search. The error message is as follows:

    self.assertRegex(example, regex, msg=f"[{prefix}] invalid LUID: {example}")
E   AssertionError: Regex didn't match: '^(?:\\d{9}|[OPQ][0-9][A-Z0-9]{3}0-9?|[A-NR-Z]0-9{1,2}(?:-\\d+)?)$' not found in 'P0DP23' : [pr] invalid LUID: P0DP23

I confirmed that if you run the following, nothing is returned

>>> import re
>>> x = re.compile(r'^(?:\\d{9}|[OPQ][0-9][A-Z0-9]{3}0-9?|[A-NR-Z]0-9{1,2}(?:-\\d+)?)$')
>>> x.search('P0DP23')

Similarly, x.match('P0DP23') also returns none

nataled commented 11 months ago

I see the problem. Somehow my pasted regex is losing lots of characters, so the one you're testing is an inaccurate regex. I'm pasting the regex into a text file (attached) to avoid this issue. Aligning the two, I see that github didn't much care for the (?: notation, plus did a few other odd removals.

PR_regex.txt

codecov[bot] commented 11 months ago

Codecov Report

All modified lines are covered by tests :white_check_mark:

Files Coverage Δ
src/bioregistry/version.py 68.42% <100.00%> (ø)

... and 12 files with indirect coverage changes

:loudspeaker: Thoughts on this report? Let us know!.

cthoyt commented 11 months ago

@nataled looks like the last message did the trick! Thanks!