Handling letter-case in IDs in a consistent, easy-to-maintain way

colleenXu commented 1 year ago

Background

During the JQ work, we encountered an issue with an external API (CTD) and the letter-case of the bioentity IDs in its responses (ref: Jackson's comment and my reply):

KEGG.PATHWAY IDs should have lower-case letters (ex: hsa05323) but our api-response-transform module code was transforming the response's ID strings to all-caps.
other namespace's IDs should have upper-case letters, but all bioentity IDs in CTD responses had lower-case letters. That's why the code for the all-caps transformation was useful in these other cases...

Issue

We realized that this was a larger issue (ref: my braindump here, a "current situation" post here):

what should be the identifier-case formats for each of the namespaces we use?
handling these identifier-case formats in a consistent way, in one place that's easy to update and maintain. Right now, it may be hidden in some api-response-transform modules.
- we should be able to simplify: it seems like most ID namespaces with letters in their identifiers use all-caps. With a few exceptions that have all-lowercase. But we'd need to check first (see first main point)
- we'd want to confirm: what code is related to identifier-case format? aka what would we refactor? where is this info used?
  - definitely in api-response-transform module, basically every transformer
  - it seems like the rest of BTE is working fine....so maybe other modules aren't affected?
  - maybe this info could be useful for building sub-queries to APIs (templating in SmartAPI yaml request info)?

colleenXu commented 1 year ago

The first step is figuring out the identifier-case formats for each of the namespaces we use. I'm likely the lead on this.

It'll take a bit of time, but not be too difficult, to review the namespaces we use and find those with lower-case letters in their IDs (aka the exceptions). (It's unnecessary to make a list of all namespaces we use, which ones of those have IDs w/ letters, which ones of those have all-caps...)

But there's an issue that I think needs untangling: I'm not clear on Translator's standards for identifier-case format. It's not clear from a quick look at the biolink-model repo...

I get the sense that we're supposed to follow the bioregistry for ID-value format...
- but it can be confusing to find the ID-value format by namespace-name/prefix because the biolink-model often uses namespace-name/prefixes that are different from bioregistry's
- an example of this confusion is https://github.com/biolink/biolink-model/issues/1366 (do the ID-values start with PMC or not?)
we are also following the identifier-case format in the NodeNorm responses...but that doesn't help us when we're using ID namespaces that aren't covered by NodeNorm...
we're probably also following some historic standards (from back when we used our own id-resolver setup / based on the core BioThings APIs / based on how the resource provides their IDs)
I dunno how stable these standards are. That's a reason to make whatever we code easy to update / maintain...

gaurav commented 1 year ago

FWIW, NodeNorm doesn't expect identifiers to be purely numerical. Here is the current distribution of CURIE prefixes with non-numerical identifiers in NodeNorm:

Count	Prefix	Example
4	KEGG.REACTION	KEGG.REACTION:R06368
6	KEGG.DISEASE	KEGG.DISEASE:H00484
22	TCDB	TCDB:3.A.1.105.1
175	PANTHER.PATHWAY	PANTHER.PATHWAY:P06210
222	EC	EC:2.1.1.n11
271	OMIM	OMIM:PS278300
1244	ComplexPortal	ComplexPortal:CPX-1047
2438	ICD10	ICD10:H44.44
7153	SGD	SGD:S000028545
13892	dictyBase	dictyBase:DDB_G0289229
14544	DRUGBANK	DRUGBANK:DB16588
19277	KEGG.COMPOUND	KEGG.COMPOUND:C18469
26128	PANTHER.FAMILY	PANTHER.FAMILY:PTHR24082:SF38
30248	SMPDB	SMPDB:SMP0052819
30293	FB	FB:FBgn0266114
38059	ZFIN	ZFIN:ZDB-LINCRNAG-131127-634
48781	WormBase	WormBase:WBGene00172747
50770	NCIT	NCIT:C123904
109255	REACT	REACT:R-HSA-5617833.2
119457	UNII	UNII:GC91Z6YS52
171624	PR	PR:Q8TF17-5
217920	HMDB	HMDB:HMDB0014933
341463	MESH	MESH:C000609086
2399666	CHEMBL.COMPOUND	CHEMBL.COMPOUND:CHEMBL4247838
3195515	UMLS	UMLS:C2334787
32043961	ENSEMBL	ENSEMBL:ENSCURG00000012216
110286527	INCHIKEY	INCHIKEY:WCVKZUDHHMZXCB-UHFFFAOYSA-N
248913563	UniProtKB	UniProtKB:A0A8H5S4H9

If the list of all those identifiers will be useful to you (it's 2.9G compressed), please let me know and I can send it over!

colleenXu commented 1 year ago

@gaurav

This table of namespaces + example IDs is great. We don't need a list of all IDs for a namespace; the examples here are fine.

colleenXu commented 1 year ago

Note to self: there are examples where there's a mix of lettercase in the ID itself - FB (flybase) and WormBase. See Gaurav's list above.

colleenXu commented 1 year ago

I had a discussion with Sierra Moxon (Translator data-modeling team), on my concerns with Translator standards:

short answer: use bioregistry as the authoritative source for "format of IDs" (Pattern for Local Unique Identifiers, Example Local Unique Identifier)
- I think it can still be useful to see how the "original namespace / resource" formats their IDs
the wider issue can be described as "what is the regex pattern for local unique identifiers"
if I notice issues with this method, bring it up with data-modeling
- it may be a one-off, but I brought one up already: "PMC/PMCID aren't identical in biolink-model but they are in bioregistry. which makes it unclear whether these IDs in translator should start with 'PMC' or not"
prefixes for namespaces are a separate issue. Generally, biolink-model is the authoritative resource for the prefix-format. If it's not there, ask their team (note that they may defer to bioregistry's format).
- ontologies tend to have all-caps prefixes, non-ontologies don't (structured vocabs, resource-specific namespaces)

However, it's unclear how to resolve the issues/discrepancies (Translator-wide)

* who's using this namespace / is affected? (not easy to tell when the namespace isn't used for nodes/bioentities) * getting everyone to agree on what to do / when. Although if the UI has requirements...that helps a lot * finding these issues in the first place can be tricky (especially Translator-wide) * standards may not be stable: resource itself can change its ID formats / prefix, or bioregistry can update

biothings / biothings_explorer

Handling letter-case in IDs in a consistent, easy-to-maintain way #735

Background

Issue