biopragmatics / bioregistry

📮 An integrative registry of biological databases, ontologies, and nomenclatures.
https://bioregistry.io
MIT License
110 stars 48 forks source link

Update `glygen` #1015

Closed jeet-vora closed 7 months ago

jeet-vora commented 7 months ago

Includes GlyGen edits

cthoyt commented 7 months ago

Hi @jeet-vora, thanks for the PR. I've reorganized a bit to denote that you are the "contact". The "contributor" field points to me since I am the one that added this record to the Bioregistry, not that I am the owner of the record. This is useful to keep track of as it makes it directly possible to get in touch and ask people why they did things the way they did.

Further, I switched the email you wrote for your direct email. It is Bioregistry policy to use single point of contact, and explicitly not group emails.

With respect to the other part you asked about "Missing LUI pattern" - the question is, can you come up with a regular expression that we can use to validate a given GlyGen local unique identifier like G24361QY? Is it always, for example, a letter, then some numbers, then two more letters?

jeet-vora commented 7 months ago

Hi @jeet-vora, thanks for the PR. I've reorganized a bit to denote that you are the "contact".

Thanks

The "contributor" field points to me since I am the one that added this record to the Bioregistry, not that I am the owner of the record. This is useful to keep track of as it makes it directly possible to get in touch and ask people why they did things the way they did.

Sounds good

Further, I switched the email you wrote for your direct email. It is Bioregistry policy to use single point of contact, and explicitly not group emails.

Not an issue. Generally, we prefer to use GlyGen email.

With respect to the other part you asked about "Missing LUI pattern" - the question is, can you come up with a regular expression that we can use to validate a given GlyGen local unique identifier like G24361QY? Is it always, for example, a letter, then some numbers, then two more letters?

So we have two primary identifiers in GlyGen, one for protein which is UniProtKB ac (P14210) and the other is glycan which is the GlyTouCan identifier (G17689DH).

The patterns are: Protein/UniProtKB ac:^([A-N,R-Z][0-9]([A-Z][A-Z, 0-9][A-Z, 0-9][0-9]){1,2})|([O,P,Q][0-9][A-Z, 0-9][A-Z, 0-9][A-Z, 0-9][0-9])(.\d+)?$

Glycan/GlyTouCan: ^G[0-9]{5}[A-Z]{2}$

Can the above be used?

in general, the Bioregistry discourages the titles for prefixes from being written this way. What does GlyGen stand for?

Our resource name is GlyGen and not Computational and Informatics Resources for Glycoscience. GlyGen is not shown when GlyGen is searched in the registry. We are fine removing Computational and Informatics Resources for Glycoscience or it can be GlyGen Computational and Informatics Resources for Glycoscience, but it will be too long.

image
codecov[bot] commented 7 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (1366cfc) 40.82% compared to head (298c555) 40.82%. Report is 21 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1015 +/- ## ======================================= Coverage 40.82% 40.82% ======================================= Files 138 138 Lines 7928 7928 Branches 1847 1847 ======================================= Hits 3237 3237 Misses 4489 4489 Partials 202 202 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

cthoyt commented 7 months ago

Thanks @jeet-vora, I incorporated your suggestions. Note that UniProt is a fully distinct semantic space and therefore does not get included in the glygen prefix, even if the corresponding GlyGen website can resolve UniProt IDs.

jeet-vora commented 6 months ago

@cthoyt Thanks a lot for helping us update the info.