Open colleenXu opened 1 year ago
The first step is figuring out the identifier-case formats for each of the namespaces we use. I'm likely the lead on this.
It'll take a bit of time, but not be too difficult, to review the namespaces we use and find those with lower-case letters in their IDs (aka the exceptions). (It's unnecessary to make a list of all namespaces we use, which ones of those have IDs w/ letters, which ones of those have all-caps...)
But there's an issue that I think needs untangling: I'm not clear on Translator's standards for identifier-case format. It's not clear from a quick look at the biolink-model repo...
PMC
or not?)FWIW, NodeNorm doesn't expect identifiers to be purely numerical. Here is the current distribution of CURIE prefixes with non-numerical identifiers in NodeNorm:
Count | Prefix | Example |
---|---|---|
4 | KEGG.REACTION | KEGG.REACTION:R06368 |
6 | KEGG.DISEASE | KEGG.DISEASE:H00484 |
22 | TCDB | TCDB:3.A.1.105.1 |
175 | PANTHER.PATHWAY | PANTHER.PATHWAY:P06210 |
222 | EC | EC:2.1.1.n11 |
271 | OMIM | OMIM:PS278300 |
1244 | ComplexPortal | ComplexPortal:CPX-1047 |
2438 | ICD10 | ICD10:H44.44 |
7153 | SGD | SGD:S000028545 |
13892 | dictyBase | dictyBase:DDB_G0289229 |
14544 | DRUGBANK | DRUGBANK:DB16588 |
19277 | KEGG.COMPOUND | KEGG.COMPOUND:C18469 |
26128 | PANTHER.FAMILY | PANTHER.FAMILY:PTHR24082:SF38 |
30248 | SMPDB | SMPDB:SMP0052819 |
30293 | FB | FB:FBgn0266114 |
38059 | ZFIN | ZFIN:ZDB-LINCRNAG-131127-634 |
48781 | WormBase | WormBase:WBGene00172747 |
50770 | NCIT | NCIT:C123904 |
109255 | REACT | REACT:R-HSA-5617833.2 |
119457 | UNII | UNII:GC91Z6YS52 |
171624 | PR | PR:Q8TF17-5 |
217920 | HMDB | HMDB:HMDB0014933 |
341463 | MESH | MESH:C000609086 |
2399666 | CHEMBL.COMPOUND | CHEMBL.COMPOUND:CHEMBL4247838 |
3195515 | UMLS | UMLS:C2334787 |
32043961 | ENSEMBL | ENSEMBL:ENSCURG00000012216 |
110286527 | INCHIKEY | INCHIKEY:WCVKZUDHHMZXCB-UHFFFAOYSA-N |
248913563 | UniProtKB | UniProtKB:A0A8H5S4H9 |
If the list of all those identifiers will be useful to you (it's 2.9G compressed), please let me know and I can send it over!
@gaurav
This table of namespaces + example IDs is great. We don't need a list of all IDs for a namespace; the examples here are fine.
Note to self: there are examples where there's a mix of lettercase in the ID itself - FB (flybase) and WormBase. See Gaurav's list above.
I had a discussion with Sierra Moxon (Translator data-modeling team), on my concerns with Translator standards:
* who's using this namespace / is affected? (not easy to tell when the namespace isn't used for nodes/bioentities) * getting everyone to agree on what to do / when. Although if the UI has requirements...that helps a lot * finding these issues in the first place can be tricky (especially Translator-wide) * standards may not be stable: resource itself can change its ID formats / prefix, or bioregistry can update
Background
During the JQ work, we encountered an issue with an external API (CTD) and the letter-case of the bioentity IDs in its responses (ref: Jackson's comment and my reply):
KEGG.PATHWAY
IDs should have lower-case letters (ex:hsa05323
) but our api-response-transform module code was transforming the response's ID strings to all-caps.Issue
We realized that this was a larger issue (ref: my braindump here, a "current situation" post here):