Closed cthoyt closed 11 months ago
@nataled it also appears that P0DP23
does not match your regex, is this because it only applies to a subset of UniProt?
In addition to the one you already have (a nine-digit string), there are the following four examples: Q15796 Q15796-1 A0A0G2QC33 A0A0G2QC33-2
I do not see how P0DP23 fails the regex:
^(\d+ covers the pure digit part
[OPQ][0-9][A-Z0-9]{3}[0-9](-\d+)? covers the six-place UniProtKB accessions
[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}(-\d+)?)$ covers the ten-place UniProtKB accessions
Thus we need to look only at the six-place part:
[OPQ][0-9][A-Z0-9]{3}[0-9](-\d+)?
[OPQ][0-9][A-Z0-9][A-Z0-9][A-Z0-9][0-9](-\d+)? (expanding the {3} for ease of aligning)
P 0 D P 2 3
It appears the other ones also fail against this regex
What's crazy is that I pulled the regex from the OBO Foundry PURL system. In that context they do not fail.
I just tried it with a small script and they all pass. How are you assessing the pass/fail?
The test run with Python's builtin unittest
module's https://docs.python.org/3/library/unittest.html#unittest.TestCase.assertRegex. This apparently uses re.search
. The error message is as follows:
self.assertRegex(example, regex, msg=f"[{prefix}] invalid LUID: {example}")
E AssertionError: Regex didn't match: '^(?:\\d{9}|[OPQ][0-9][A-Z0-9]{3}0-9?|[A-NR-Z]0-9{1,2}(?:-\\d+)?)$' not found in 'P0DP23' : [pr] invalid LUID: P0DP23
I confirmed that if you run the following, nothing is returned
>>> import re
>>> x = re.compile(r'^(?:\\d{9}|[OPQ][0-9][A-Z0-9]{3}0-9?|[A-NR-Z]0-9{1,2}(?:-\\d+)?)$')
>>> x.search('P0DP23')
Similarly, x.match('P0DP23')
also returns none
I see the problem. Somehow my pasted regex is losing lots of characters, so the one you're testing is an inaccurate regex. I'm pasting the regex into a text file (attached) to avoid this issue. Aligning the two, I see that github didn't much care for the (?: notation, plus did a few other odd removals.
All modified lines are covered by tests :white_check_mark:
Files | Coverage Δ | |
---|---|---|
src/bioregistry/version.py | 68.42% <100.00%> (ø) |
... and 12 files with indirect coverage changes
:loudspeaker: Thoughts on this report? Let us know!.
@nataled looks like the last message did the trick! Thanks!
Closes #959
@nataled can you provide a few more example local unique identifiers? I guess that the regex you supplied also includes UniProt isoforms