biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
10 stars 11 forks source link

Handling letter-case in IDs in a consistent, easy-to-maintain way #735

Open colleenXu opened 1 year ago

colleenXu commented 1 year ago

Background

During the JQ work, we encountered an issue with an external API (CTD) and the letter-case of the bioentity IDs in its responses (ref: Jackson's comment and my reply):

Issue

We realized that this was a larger issue (ref: my braindump here, a "current situation" post here):

colleenXu commented 1 year ago

The first step is figuring out the identifier-case formats for each of the namespaces we use. I'm likely the lead on this.

It'll take a bit of time, but not be too difficult, to review the namespaces we use and find those with lower-case letters in their IDs (aka the exceptions). (It's unnecessary to make a list of all namespaces we use, which ones of those have IDs w/ letters, which ones of those have all-caps...)

But there's an issue that I think needs untangling: I'm not clear on Translator's standards for identifier-case format. It's not clear from a quick look at the biolink-model repo...

gaurav commented 1 year ago

FWIW, NodeNorm doesn't expect identifiers to be purely numerical. Here is the current distribution of CURIE prefixes with non-numerical identifiers in NodeNorm:

Count Prefix Example
4 KEGG.REACTION KEGG.REACTION:R06368
6 KEGG.DISEASE KEGG.DISEASE:H00484
22 TCDB TCDB:3.A.1.105.1
175 PANTHER.PATHWAY PANTHER.PATHWAY:P06210
222 EC EC:2.1.1.n11
271 OMIM OMIM:PS278300
1244 ComplexPortal ComplexPortal:CPX-1047
2438 ICD10 ICD10:H44.44
7153 SGD SGD:S000028545
13892 dictyBase dictyBase:DDB_G0289229
14544 DRUGBANK DRUGBANK:DB16588
19277 KEGG.COMPOUND KEGG.COMPOUND:C18469
26128 PANTHER.FAMILY PANTHER.FAMILY:PTHR24082:SF38
30248 SMPDB SMPDB:SMP0052819
30293 FB FB:FBgn0266114
38059 ZFIN ZFIN:ZDB-LINCRNAG-131127-634
48781 WormBase WormBase:WBGene00172747
50770 NCIT NCIT:C123904
109255 REACT REACT:R-HSA-5617833.2
119457 UNII UNII:GC91Z6YS52
171624 PR PR:Q8TF17-5
217920 HMDB HMDB:HMDB0014933
341463 MESH MESH:C000609086
2399666 CHEMBL.COMPOUND CHEMBL.COMPOUND:CHEMBL4247838
3195515 UMLS UMLS:C2334787
32043961 ENSEMBL ENSEMBL:ENSCURG00000012216
110286527 INCHIKEY INCHIKEY:WCVKZUDHHMZXCB-UHFFFAOYSA-N
248913563 UniProtKB UniProtKB:A0A8H5S4H9

If the list of all those identifiers will be useful to you (it's 2.9G compressed), please let me know and I can send it over!

colleenXu commented 1 year ago

@gaurav

This table of namespaces + example IDs is great. We don't need a list of all IDs for a namespace; the examples here are fine.

colleenXu commented 1 year ago

Note to self: there are examples where there's a mix of lettercase in the ID itself - FB (flybase) and WormBase. See Gaurav's list above.

colleenXu commented 1 year ago

I had a discussion with Sierra Moxon (Translator data-modeling team), on my concerns with Translator standards:

However, it's unclear how to resolve the issues/discrepancies (Translator-wide)

* who's using this namespace / is affected? (not easy to tell when the namespace isn't used for nodes/bioentities) * getting everyone to agree on what to do / when. Although if the UI has requirements...that helps a lot * finding these issues in the first place can be tricky (especially Translator-wide) * standards may not be stable: resource itself can change its ID formats / prefix, or bioregistry can update