RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

Some issues found that might be associated with Nodesynonymizer from KG2.6.1 #1425

Closed chunyuma closed 3 months ago

chunyuma commented 3 years ago

Also based on the KG2.6.1c (http://kg2canonicalized.rtx.ai:7474/browser/) that @amykglen just built, I found two problems that might be associated with Nodesynonymizer:

1) There are 35 duplicated preferred curies that have same name and same description in KG2.6.1: Here are the names of them:

       'Leukemia Virus, Bovine': 1,
       'Coronavirus NL63, Human': 1,
       'Coronavirus, Bovine': 1,
       'Hemorrhagic Disease Virus, Epizootic': 1,
       'Calicivirus, Feline': 1,
       'Adenoviruses, Porcine': 1,
       'Chive': 1,
       'furanyl fentanyl': 1,
       'Peste-des-petits-ruminants virus': 1,
       'Tsetse Flies': 1,
       'Proboscidea Mammal': 1,
       'Conus Snail': 1,
       'DEXOXADROL': 1,
       'BUTYLATED HYDROXYANISOLE': 1,
       'Artiodactyla': 1,
       'Aloysia': 1,
       'Setaria Nematode': 1,
       'METHYLMETHIONINE SULFONIUM CHLORIDE': 1,
       'glucose, glycerol, hydroxyethyl starch, perfluorodecalin, perfluorotripropylamine, pluronic F-68, salts, yolk phospholipids drug combination': 1,
       'Liriope Plant': 1,
       'Lawsonia Plant': 1,
       'DORANIDAZOLE': 1,
       'NAPITANE': 1,
       '1-(3,4-Dihydroxy-5-hydroxymethyl-tetrahydro-furan-2-yl)-5-methyl-1H-pyrimidine-2,4-dione': 1,
       'Hyssopus Plant': 1,
       'bilane analogue': 1,
       '2-Amino-5-[1-(1-carboxy-2-methyl-propylcarbamoyl)-2-mercapto-ethylcarbamoyl]-pentanoic acid': 1,
       '2-(3-Ethyl-8-methoxy-1,2,3,4,6,7,12,12b-octahydro-indolo[2,3-a]quinolizin-2-yl)-3-methoxy-acrylic acid methyl ester': 1,
       '1-(3-Carboxy-propionyl)-pyrrolidine-2-carboxylic acid': 1,
       '4-Hydroxy-3,4a,8-trimethyl-3,3a,4,4a,7a,8,9,9a-octahydro-azuleno[6,5-b]furan-2,5-dione': 1,
       '2,6-Dimethyl-5-nitro-4-(2-trifluoromethyl-phenyl)-1,4-dihydro-pyridine-3-carboxylic acid methyl ester': 1,
       '9-[3-Hydroxy-6-(2-hydroxy-ethylidene)-2-methyl-oxepan-2-yl]-2,6-dimethyl-non-2-en-5-one': 1,
       '9-[3-Hydroxy-6-(2-hydroxy-ethylidene)-2-methyl-oxepan-2-yl]-2,3,6-trimethyl-non-3-en-5-one': 1,
       'Heptadeca-8,10-diene-4,6-diyne-1,12-diol': 1,
       '{1-[1-Benzyl-3-(2-tert-butylcarbamoyl-pyrrolidin-1-yl)-2-hydroxy-3-oxo-propylcarbamoyl]-2-carbamoyl-ethyl}-carbamic acid benzyl ester'

This is one example:

n.id n.name n.description
"UMLS:C0006069" "Leukemia Virus, Bovine" "The type species of DELTARETROVIRUS that causes a form of bovine lymphosarcoma (ENZOOTIC BOVINE LEUKOSIS) or persistent lymphocytosis.; UMLS Semantic Type: UMLS_STY:T005"
"MESH:D001909" "Leukemia Virus, Bovine" "The type species of DELTARETROVIRUS that causes a form of bovine lymphosarcoma (ENZOOTIC BOVINE LEUKOSIS) or persistent lymphocytosis.; UMLS Semantic Type: UMLS_STY:T005"

2) There are 1870 curies in KG2.6.1 that have no name but their synonyms have name:

match (n) where n.name is NULL and n.all_names is not NULL return distinct n.id, n.name, n.all_names
n.id n.name n.all_names
"CHEMBL.COMPOUND:CHEMBL3302426" null ["(3R,5S)-3,5-dimethyl-1-adamantanamine", "2-[(1-benzylpiperidin-4-yl)methyl]-5,6-dimethoxyindan-1-one", "Aricept", "DONEPEZIL", "DONEPEZIL HYDROCHLORIDE", "Donepezil", "Donepezil Hydrochloride", "Donepezil hydrochloride", "Donepezil-containing product", "MEMANTINE", "MEMANTINE HYDROCHLORIDE", "Memantine", "Memantine-containing product", "Namenda", "Namzaric", "SID11111499", "donepezil", "donepezil and memantine", "donepezil hydrochloride", "donepezil, memantine and Ginkgo folium", "memantine", "memantine hydrochloride"]
"CHEMBL.COMPOUND:CHEMBL4303798" null ["(6R,7R)-3-((1H-imidazo[1,2-b]pyridazin-4-ium-1-yl)methyl)-7-((E)-2-(5-amino-1,2,4-thiadiazol-3-yl)-2-(methoxyimino)acetamido)-8-oxo-5-thia-1-azabicyclo[4.2.0]oct-2-ene-2-carboxylate", "CEFOZOPRAN", "Cefozopran", "cefozopran"]
"CHEMBL.COMPOUND:CHEMBL183041" null ["Aptivus", "TIPRANAVIR", "Tipranavir", "Tipranavir-containing product", "tipranavir"]
"CHEMBL.COMPOUND:CHEMBL383921" null ["(2S)-2-[(S)-(2-ethoxyphenoxy)-phenylmethyl]morpholine", "2-[(2-ethoxyphenoxy)-phenylmethyl]morpholine", "ESREBOXETINE", "Esreboxetine", "REBOXETINE", "REBOXETINE MESYLATE", "Reboxetine", "Reboxetine Mesylate", "Reboxetine mesylate", "Reboxetine-containing product", "esreboxetine", "reboxetine", "reboxetine mesylate"]
"CHEMBL.COMPOUND:CHEMBL7252" null ["4-hydroxy-3-(3-oxo-1-phenylbutyl)-1-benzopyran-2-one", "Coumadin", "Jantoven", "WARFARIN", "WARFARIN POTASSIUM", "WARFARIN SODIUM", "Warfarin", "Warfarin Potassium", "Warfarin Sodium", "Warfarin sodium", "Warfarin-containing product", "warfarin", "warfarin potassium", "warfarin sodium"]
"CHEMBL.COMPOUND:CHEMBL4303794" null ["(2R,3S,4R,5R,6R)-5-amino-2-(aminomethyl)-6-[(1R,3S,4R,6S)-4,6-diamino-2-[[(2S,3R,4S,5R)-3,4-dihydroxy-5-(hydroxymethyl)-2-oxolanyl]oxy]-3-hydroxycyclohexyl]oxyoxane-3,4-diol", "RIBOSTAMYCIN", "RIBOSTAMYCIN SULFATE", "Ribostamycin", "SID144204208", "ribostamycin", "ribostamycin sulfate"]
amykglen commented 3 years ago

re: item 1: hm, yeah, I'm not sure why these nodes aren't merged in KG2c.6.3. I verified they're also separate in the synonymizer, so maybe @edeutsch has some insight (e.g., --lookup UMLS:C0006069 and MESH:D001909 for the 'Leukemia Virus, Bovine' example).

re: item 2: I checked and it seems this was already present in KG2c-5-2, so I don't think it's a new issue (that same query returns 2785 records in KG2c-5-2). the KG2c node name is the 'preferred_name' according to the synonymizer - I verified that the synonymizer for some reason seems to say the preferred_name is empty for these nodes. e.g.:

    "id": {
      "SRI_normalizer_category": "biolink:ChemicalSubstance",
      "SRI_normalizer_curie": "CHEMBL.COMPOUND:CHEMBL3302426",
      "SRI_normalizer_name": "CHEMBL3302426",
      "category": "biolink:ChemicalSubstance",
      "identifier": "CHEMBL.COMPOUND:CHEMBL3302426",
      "name": ""
    },

I'd say neither of these issues seem to be show-stoppers for rolling out KG2.6.3, since the first one affects such a small number of nodes and the second appears to be a bug that was already present (and also doesn't affect a huge number of nodes).

chunyuma commented 3 years ago

for item 2, although this was already present in kg2c-5-2 but it doesn't make sense that the preferred name is empty but it has other names which are not empty.

amykglen commented 3 years ago

yep, I agree, and it's something to fix in the synonymizer. I was just pointing out that it doesn't seem to be a recent change that caused it.

amykglen commented 1 year ago

@chunyuma - I believe both of the issues you reported here are fixed in KG2.8.0.1c:

  1. I can't check all of the names you listed because names have changed since you reported this, but the 'Leukemia Virus, Bovine' example is fixed (UMLS:C0006069 and MESH:D001909 are now in the same cluster)
  2. any node that has all_names also has a name (i.e., your query now returns no results)

can we close this issue?

edeutsch commented 3 months ago

I think this is okay to close.