TranslatorSRI / Babel

Babel creates cliques of equivalent identifiers across many biomedical vocabularies.
MIT License
9 stars 2 forks source link

Add HMDB secondary accessions to cliques? #177

Open gaurav opened 1 year ago

gaurav commented 1 year ago

@EvanDietzMorris reported that a large number of HMDB identifiers are missing from cliques. I'll add the full list here for later testing.

Full list of reported missing HMDB identifiers ``` HMDB:HMDB05323 HMDB:HMDB09069 HMDB:HMDB04950 HMDB:HMDB11376 HMDB:HMDB11151 HMDB:HMDB11342 HMDB:HMDB07984 HMDB:HMDB08156 HMDB:HMDB11220 HMDB:HMDB08054 HMDB:HMDB34220 HMDB:HMDB05779 HMDB:HMDB00560 HMDB:HMDB59745 HMDB:HMDB09809 HMDB:HMDB07949 HMDB:HMDB07874 HMDB:HMDB07988 HMDB:HMDB00072 HMDB:HMDB09002 HMDB:HMDB11507 HMDB:HMDB07218 HMDB:HMDB13127 HMDB:HMDB11206 HMDB:HMDB07121 HMDB:HMDB07958 HMDB:HMDB11129 HMDB:HMDB07892 HMDB:HMDB10381 HMDB:HMDB09815 HMDB:HMDB09784 HMDB:HMDB61699 HMDB:HMDB11517 HMDB:HMDB05324 HMDB:HMDB61695 HMDB:HMDB11203 HMDB:HMDB13288 HMDB:HMDB07248 HMDB:HMDB10383 HMDB:HMDB07228 HMDB:HMDB11253 HMDB:HMDB11565 HMDB:HMDB08039 HMDB:HMDB10405 HMDB:HMDB10392 HMDB:HMDB09821 HMDB:HMDB07940 HMDB:HMDB11262 HMDB:HMDB07883 HMDB:HMDB11489 HMDB:HMDB00477 HMDB:HMDB11343 HMDB:HMDB05780 HMDB:HMDB12091 HMDB:HMDB08008 HMDB:HMDB01878 HMDB:HMDB10404 HMDB:HMDB12103 HMDB:HMDB11503 HMDB:HMDB10391 HMDB:HMDB10407 HMDB:HMDB07970 HMDB:HMDB09009 HMDB:HMDB10394 HMDB:HMDB09783 HMDB:HMDB11496 HMDB:HMDB08057 HMDB:HMDB07112 HMDB:HMDB07132 HMDB:HMDB07103 HMDB:HMDB05349 HMDB:HMDB10388 HMDB:HMDB61702 HMDB:HMDB08141 HMDB:HMDB00488 HMDB:HMDB03148 HMDB:HMDB13405 HMDB:HMDB08055 HMDB:HMDB08994 HMDB:HMDB07250 HMDB:HMDB06469 HMDB:HMDB11207 HMDB:HMDB05334 HMDB:HMDB11149 HMDB:HMDB11170 HMDB:HMDB12085 HMDB:HMDB06210 HMDB:HMDB11477 HMDB:HMDB08111 HMDB:HMDB11394 HMDB:HMDB11474 HMDB:HMDB08123 HMDB:HMDB08047 HMDB:HMDB00613 HMDB:HMDB07257 HMDB:HMDB11211 HMDB:HMDB11384 HMDB:HMDB10379 HMDB:HMDB12087 HMDB:HMDB13122 HMDB:HMDB11375 HMDB:HMDB12104 HMDB:HMDB12102 HMDB:HMDB07856 HMDB:HMDB12107 HMDB:HMDB11244 HMDB:HMDB08939 HMDB:HMDB11243 HMDB:HMDB09093 HMDB:HMDB11352 HMDB:HMDB08056 HMDB:HMDB08113 HMDB:HMDB06460 HMDB:HMDB61690 HMDB:HMDB08017 HMDB:HMDB07969 HMDB:HMDB10408 HMDB:HMDB09814 HMDB:HMDB06347 HMDB:HMDB61701 HMDB:HMDB07219 HMDB:HMDB11460 HMDB:HMDB12097 HMDB:HMDB12105 HMDB:HMDB13326 HMDB:HMDB11686 HMDB:HMDB41708 HMDB:HMDB61696 HMDB:HMDB09789 HMDB:HMDB08038 HMDB:HMDB11487 HMDB:HMDB08993 HMDB:HMDB08279 HMDB:HMDB11476 ```

I checked four of those identifiers and they're all listed as "secondary accessions". I then checked the primary accessions against NodeNorm RENCI-exp, and all four of them are present. Here are the primary accessions, which at the present are the only ones we ingest:

Primary accession Secondary accession
HMDB:HMDB0008937 HMDB:HMDB05323
HMDB:HMDB0008057 HMDB:HMDB08057
HMDB:HMDB0240261 HMDB:HMDB61696
HMDB:HMDB0061701 HMDB:HMDB61701

Our HMDB extraction code currently only uses primary identifiers:

https://github.com/TranslatorSRI/Babel/blob/011e36c7e096938fb3981b651904bc7c11a28b92/src/datahandlers/hmdb.py#L14

Should we modify our code to add secondary accessions as well? Unless the hmdb_metabolites.xml file that we used reuses secondary accessions, this should be safe and not cause any cliquing problems. We could also do a one-off experiment to confirm that secondary accessions are in fact uniquely mapped to primary accessions.

gglusman commented 1 year ago

For context these come from our Multiomics Wellness KG. Years ago HMDB changed their identifier format from HMDB followed by 5 digits, to HMDB followed by 7 digits. All existing identifiers had 00 prepended to the number to create a new primary, and were assigned as secondary. So there should be no identifier reuse.

I modified my code to fix the old HMDB identifiers for the next version of the KG, but would still recommend including the old identifiers in babel! They are in very widespread use.