NCATSTranslator / Feedback

A repo for tracking gaps in Translator data and finding ways to fill them.
7 stars 0 forks source link

Duplicated results #953

Open khanspers opened 1 month ago

khanspers commented 1 month ago

I'm seeing something similar to what was reported here with some results being reported twice, with different identifiers:

Screen Shot 2024-09-20 at 5 07 30 PM

The query is "What genes' activity may be decreased by Imatinib Mesylate": https://ui.test.transltr.io/results?l=Imatinib%20Mesylate&i=CHEBI:31690&t=4&r=0&q=c1469eda-0209-4d12-98ad-952498685545

sstemann commented 1 month ago

@gaurav could you please take a look?

gaurav commented 1 month ago

Thanks for poking me about this one, Sarah!

C-KIT Gene

UMLS:C0920288 "C-KIT Gene" does look like it should be combined with NCBIGene:3815 "KIT". There are two ways of connecting these two concepts via UMLS:

We don't currently ingest NCI or LOINC, so that's probably why these are missing. I've opened an issue to look into whether we should include NCIT mappings for genes (https://github.com/TranslatorSRI/Babel/issues/350). We already have a ticket to ingest LOINC mappings (https://github.com/TranslatorSRI/Babel/issues/295), but even if we were to do that, I'm not sure we would include gene mappings from there. I don't think there's a quick fix here apart from ingesting those mappings.

Note that there is one UMLS identifiers already associated with that gene in NodeNorm Guppy, which is UMLS:C1416655 "KIT gene". So we could also think of this as an error on UMLS' part for not combining UMLS:C1416655 and UMLS:C0920288 into a single concept. I've sent a message to their helpdesk to see if they agree.

FWIW both UMLS IDs are coming back from SemMedDB, but I don't think there's any clever way of catching the duplication there.

ABL1 gene/protein

Similar situation: the only thing UMLS:C1439337 "Tyrosine-Protein Kinase ABL1, human" is connected to is NCIT:C17390 "Tyrosine-Protein Kinase ABL1", which is connected to OMIM:189980 and SwissPort P00519, both of which resolve to NCBIGene:25 on NodeNorm Guppy. So including NCIT mappings would fix this as well.

Overall resolution

Depends on adding NCIT mapping to genes in NodeNorm (https://github.com/TranslatorSRI/Babel/issues/350), which I'll try to get into the NodeNorm November release.