TranslatorSRI / NodeNormalization

Service that produces Translator compliant nodes given a curie
MIT License
9 stars 6 forks source link

After restoring from backup, the order of preferred CURIEs after drug conflation is incorrect #220

Closed gaurav closed 3 months ago

gaurav commented 9 months ago

Compare https://nodenormalization-sri.renci.org/1.4/get_normalized_nodes?curie=MESH%3AD014867&conflate=true&drug_chemical_conflate=true&description=false with https://nodenormalization-dev.apps.renci.org/1.4/get_normalized_nodes?curie=MESH%3AD014867&conflate=true&drug_chemical_conflate=true&description=false -- to further confuse matters, the actual line from the conflation file is:

["PUBCHEM.COMPOUND:10129877", "PUBCHEM.COMPOUND:105142", "PUBCHEM.COMPOUND:962", "RXCUI:1043588", "RXCUI:1045437", "RXCUI:1045439", "RXCUI:1053147", "RXCUI:1053148", "RXCUI:1053172", "RXCUI:1053173", "RXCUI:1053428", "RXCUI:1053429", "RXCUI:1053489", "RXCUI:1053490", "RXCUI:1151100", "RXCUI:1151101", "RXCUI:1161792", "RXCUI:1161794", "RXCUI:1161795", "RXCUI:1180556", "RXCUI:1235498", "RXCUI:1235499", "RXCUI:1235500", "RXCUI:1235501", "RXCUI:1235502", "RXCUI:1235503", "RXCUI:1235504", "RXCUI:1310241", "RXCUI:1314884", "RXCUI:1423320", "RXCUI:1423321", "RXCUI:1425974", "RXCUI:1425975", "RXCUI:1425976", "RXCUI:1425977", "RXCUI:1425978", "RXCUI:1489375", "RXCUI:1489376", "RXCUI:1489377", "RXCUI:1489378", "RXCUI:150985", "RXCUI:1539535", "RXCUI:1549855", "RXCUI:204918", "RXCUI:2108561", "RXCUI:2360606", "RXCUI:2360607", "RXCUI:2360608", "RXCUI:2360609", "RXCUI:2360610", "RXCUI:2601721", "RXCUI:2601722", "RXCUI:340584", "RXCUI:379002", "UMLS:C0359299", "UMLS:C1883551", "UMLS:C3857954"]

So really PUBCHEM.COMPOUND:10129877 "Water-O-15" should be preferred ID!

This might be because:

gaurav commented 9 months ago

Ah, wait, I was looking at an older conflation file -- the latest conflation file does implement CURIE suffix sorting, so that PUBCHEM.COMPOUND:962 should be the correct preferred ID for water:

["PUBCHEM.COMPOUND:962", "PUBCHEM.COMPOUND:105142", "PUBCHEM.COMPOUND:10129877", "RXCUI:150985", "RXCUI:204918", "RXCUI:340584", "RXCUI:379002", "RXCUI:1043588", "RXCUI:1045437", "RXCUI:1045439", "RXCUI:1053147", "RXCUI:1053148", "RXCUI:1053172", "RXCUI:1053173", "RXCUI:1053428", "RXCUI:1053429", "RXCUI:1053489", "RXCUI:1053490", "RXCUI:1151100", "RXCUI:1151101", "RXCUI:1161792", "RXCUI:1161794", "RXCUI:1161795", "RXCUI:1180556", "RXCUI:1235498", "RXCUI:1235499", "RXCUI:1235500", "RXCUI:1235501", "RXCUI:1235502", "RXCUI:1235503", "RXCUI:1235504", "RXCUI:1310241", "RXCUI:1314884", "RXCUI:1423320", "RXCUI:1423321", "RXCUI:1424601", "RXCUI:1424602", "RXCUI:1424603", "RXCUI:1424604", "RXCUI:1424605", "RXCUI:1425974", "RXCUI:1425975", "RXCUI:1425976", "RXCUI:1425977", "RXCUI:1425978", "RXCUI:1489375", "RXCUI:1489376", "RXCUI:1489377", "RXCUI:1489378", "RXCUI:1539535", "RXCUI:1549855", "RXCUI:2108561", "RXCUI:2360606", "RXCUI:2360607", "RXCUI:2360608", "RXCUI:2360609", "RXCUI:2360610", "RXCUI:2601721", "RXCUI:2601722", "UMLS:C0359299", "UMLS:C1883551", "UMLS:C3857954"]

I still don't know why it's returning RXCUI:1161795 as the preferred ID, though.

gaurav commented 9 months ago

I could fix this by reloading the database, so yes, it appears to be the copying process that is at fault. Chris tells me that the conflation code uses the order of the results in Redis, so it may be that the copying code in https://github.com/helxplatform/translator-devops/pull/768 isn't preserving that order for some reason, maybe because we're getting back Redis protocol commands in an unusual order (see https://github.com/sripathikrishnan/redis-rdb-tools#emitting-redis-protocol for more information).

gaurav commented 9 months ago

Question to myself: is it true that GeneProtein (which is structured identically to ChemicalDrug) was restored without any problem, or is the order also broken for that file?

gaurav commented 3 months ago

After several restores, we haven't seen this problem return, so it does appear to have been caused by using an incorrect input file. I'm going to go ahead and close this, but will reopen it if the problem recurs.