TranslatorSRI / Babel

Babel creates cliques of equivalent identifiers across many biomedical vocabularies.
MIT License
8 stars 2 forks source link

Water results in "WATER O 15" (PUBCHEM.COMPOUND:10129877) in NameRes because of a conflation issue #264

Closed gaurav closed 2 months ago

gaurav commented 2 months ago

This is because NameRes entries are all based on DrugConflated results, and the conflation for water is:

["PUBCHEM.COMPOUND:10129877", "CHEBI:15377", "CHEBI:33813", "RXCUI:150985", "RXCUI:204918", "RXCUI:340584", "RXCUI:379002", "RXCUI:1043588", "RXCUI:1045437", "RXCUI:1045439", "RXCUI:1053147", "RXCUI:1053148", "RXCUI:1053172", "RXCUI:1053173", "RXCUI:1053428", "RXCUI:1053429", "RXCUI:1151100", "RXCUI:1151101", "RXCUI:1161792", "RXCUI:1161794", "RXCUI:1161795", "RXCUI:1180556", "RXCUI:1235498", "RXCUI:1235499", "RXCUI:1235500", "RXCUI:1235501", "RXCUI:1235502", "RXCUI:1235503", "RXCUI:1235504", "RXCUI:1310241", "RXCUI:1314884", "RXCUI:1423320", "RXCUI:1423321", "RXCUI:1424601", "RXCUI:1424602", "RXCUI:1424603", "RXCUI:1424604", "RXCUI:1424605", "RXCUI:1425974", "RXCUI:1425975", "RXCUI:1425976", "RXCUI:1425977", "RXCUI:1425978", "RXCUI:1489375", "RXCUI:1489376", "RXCUI:1489377", "RXCUI:1489378", "RXCUI:1539535", "RXCUI:1549855", "RXCUI:2108561", "RXCUI:2282752", "RXCUI:2282753", "RXCUI:2360606", "RXCUI:2360607", "RXCUI:2360608", "RXCUI:2360609", "RXCUI:2360610", "RXCUI:2601721", "RXCUI:2601722", "UMLS:C0359299", "UMLS:C3857954", "UMLS:C1883551"]

So why is PUBCHEM.COMPOUND:10129877 ("WATER O 15") ranked above CHEBI:15377 ("water")? This is because after we generate the initial conflation, the leading ID is RXCUI:1425974 ("Opticlear"), which is a biolink:Drug. As a biolink:Drug, PUBCHEM.COMPOUND is a more preferred prefix than CHEBI:

INFO:src.createcompendia.drugchemical:Leading ID RXCUI:1425974 normalized to RXCUI:1425974 (type biolink:Drug) with prefixes: ['ncats.drug', 'RXCUI', 'NDC', 'UMLS', 'PUBCHEM.COMPOUND', 'CHEMBL.COMPOUND', 'UNII', 'CHEBI', 'MESH', 'CAS', 'GTOPDB', 'HMDB', 'KEGG', 'KEGG.COMPOUND', 'ChemBank', 'PUBCHEM.SUBSTANCE', 'INCHI', 'INCHIKEY', 'KEGG.GLYCAN', 'KEGG.ENVIRON', 'SIDER.DRUG', 'BIGG.METABOLITE', 'foodb.compound']

So, options for fixing this:

  1. Conflate fewer things together, so things like Opticlear won't get conflated with water. But this will be much trickier to implement.
  2. We could try determining the type not by a random ID but by some sort of consensus calculation, but given all the RXCUIs I suspect everything will end up as a biolink:Drug.
  3. I don't think we ever want RXCUIs to affect the type calculation. So we could filter them all out (along with all the biolink:ChemicalEntity CURIEs), then base the type on a consensus of the other IDs.
  4. ???
cbizon commented 2 months ago

I think 3 is fine. But I wonder if we can handle this at a per-clique level. We're merging a series of cliques, and each clique has a type. Can we have a preferred series of types and then just choose our favorite type from across the cliques? Drug at bottom, small molecule at top?

gaurav commented 2 months ago

Discussion result:

  1. Determine conflation type based on a preferred list of types: SmallMolecule is most preferred, ChemicalEntity is least preferred, Drug is somewhere near the bottom.
  2. Another potential approach would be to pick the clique leader using the number of identifiers in the clique, BUT this could push us towards lots of Drugs (if a single drug has a ton of formulations, say), so we should document this as a potential solution but try the conflation type approach first.