Closed gaurav closed 6 months ago
The problem appears to be that PUBCHEM.COMPOUND:440055 is conflated with RXCUI:2642190, which is missing from NodeNorm for some reason: https://nodenormalization-sri.renci.org/1.4/get_normalized_nodes?curie=RXCUI:2642190&conflate=true&drug_chemical_conflate=false&description=false
This apparently causes an exception to be thrown when trying to figure out the types of this identifier:
[2023-10-09 20:48:09 +0000] [13] [INFO] 172.25.13.150:55558 - "GET /docs HTTP/1.1" 200
2023-10-09 20:48:10,600 | ERROR | normalizer:get_normalized_nodes | Exception: Traceback (most recent call last):
File "/code/node_normalizer/normalizer.py", line 559, in get_normalized_nodes
eqids2, types2 = await get_eqids_and_types(app, all_other_ids)
File "/code/node_normalizer/normalizer.py", line 486, in get_eqids_and_types
types = [get_ancestors(app, t) for t in types]
File "/code/node_normalizer/normalizer.py", line 486, in <listcomp>
types = [get_ancestors(app, t) for t in types]
File "/code/node_normalizer/normalizer.py", line 24, in get_ancestors
a = app.state.toolkit.get_ancestors(input_type)
File "/usr/local/lib/python3.9/site-packages/bmt/util.py", line 120, in wrapper
case = guess_casing(s)
File "/usr/local/lib/python3.9/site-packages/bmt/util.py", line 65, in guess_casing
if "_" in s:
TypeError: argument of type 'NoneType' is not iterable
It looks like there are 41 cliques on https://nodenormalization-sri.renci.org/ (current NodeNorm Dev) that currently cause this error:
more of them: PUBCHEM.COMPOUND:5352133 PUBCHEM.COMPOUND:5353622 PUBCHEM.COMPOUND:5742832 PUBCHEM.COMPOUND:5702160 PUBCHEM.COMPOUND:162533872 PUBCHEM.COMPOUND:5284447 PUBCHEM.COMPOUND:5479530
Note the oddity of PUBCHEM.COMPOUND:124220636, which should only be conflated with PUBCHEM.COMPOUND:165411920 and UMLS:C5402366, all three of which have type information.
Update: this appears to be caused by the middle ID, but then how come it works when conflation is turned off without problems?
No type information found for 'PUBCHEM.COMPOUND:165411920' with eqids: None.
Update: PUBCHEM.COMPOUND:165411920 is part of the PUBCHEM.COMPOUND:124220636 clique. Perhaps that is confusing the conflation algorithm somehow?
Right, so PUBCHEM.COMPOUND:165411920 really doesn't have a type. The type database only uses the preferred identifier to retrieve the type. The problem is that the conflation contains something other than a preferred identifier. In conflation, it is assuming that those are all preferred, and will therefore all have entries in the right places.
It works when you just go with *20 because it first hits the main index, finds the preferred ID and then uses that to go look up the type.
The upshot of which: Your more careful response to the error is all that can be done in nodenorm, and the real problem is in Babel (oops).
Here are the cliques for those three identifiers from our previous run. As you said, it's only two cliques, since PUBCHEM.COMPOUND:124220636 is the clique leader for PUBCHEM.COMPOUND:165411920.
{"type": "biolink:MolecularMixture", "identifiers": [{"i": "PUBCHEM.COMPOUND:124220636", "l": "Copper dotatate Cu-64", "d": []}, {"i": "PUBCHEM.COMPOUND:165411920", "l": "Detectnet", "d": []}, {"i": "CHEMBL.COMPOUND:CHEMBL4297339", "l": "COPPER OXODOTREOTIDE CU-64", "d": []}, {"i": "UNII:N3858377KC", "l": "COPPER OXODOTREOTIDE CU-64", "d": []}, {"i": "DRUGBANK:DB15873", "d": []}, {"i": "MESH:C000718307", "l": "copper dotatate CU-64", "d": []}, {"i": "MESH:C575629", "l": "64Cu-DOTATATE", "d": []}, {"i": "DrugCentral:5411", "l": "copper dotatate Cu-64", "d": []}, {"i": "HMDB:HMDB0304900", "l": "Copper Cu 64 Dotatate", "d": []}, {"i": "INCHIKEY:IJRLLVFQGCCPPI-NVGRTJHCSA-L", "d": []}, {"i": "UMLS:C3502191", "l": "copper oxodotreotide CU-64", "d": []}, {"i": "RXCUI:2396442", "d": []}]}
{"type": "biolink:ChemicalEntity", "identifiers": [{"i": "UMLS:C5402366", "l": "dodatate", "d": []}, {"i": "RXCUI:2396443", "d": []}]}
The Babel regeneration stalled sometime this morning, but I've restarted it now. Once that's done I'll confirm that it also has the incorrect conflation, but I think I can assume that that will be the case and look for issues in the DrugChemical conflation code in the meantime.
I've modified the code in PR https://github.com/TranslatorSRI/Babel/pull/191 to skip identifiers where the object isn't present in either drug_rxcui_to_clique or chemical_rxcui_to_clique, which are mappings from RXCUIs to clique leaders. This seems to have eliminated the PUBCHEM.COMPOUND identifiers that we don't actually want conflated, but I'm pretty sure that it's skipping too many non-RXCUI mappings that we actually do want in these conflated cliques. Here is a list of the 11,737 subject-object pairs being skipped by this change: in 4,508 cases this is a mapping from an identifier to itself (e.g. PUBCHEM.COMPOUND:6914273 to PUBCHEM.COMPOUND:6914273), while in the other 7,229 cases it's between two different identifiers.
The dumb next move would to remove the RXCUI filters entirely from load_cliques(), so then we can figure out the clique leader for every identifier we want to conflate (at the cost of loading all the chemical identifiers back into memory, but whatever). @cbizon Do you think that makes sense, or can you come up with a smarter way to make sure the DrugChemical file only contains clique leaders?
If we were concerned about keeping memory low, the only other thing I can think of would be to generate the DrugChemical conflation as before, but then do a post-process on it where we load that whole thing into memory and then cycle over the chemical cliques. Each time we read a new clique leader we check the memory DC conflation, and if we find it in there, then we move it over into a new, cleaned conflation, which we write out at the end. But honestly, it seems like overkill. Probably best just to load all the clique leader IDs and be done with it.
I think you're right that the code in the PR is removing too much - the object for some of these is still a clique leader, even if it's not in those maps.
I've been thinking about this issue some, and I wonder if there's a simpler, more comprehensive fix that we can make in NodeNorm itself. The error is the following:
[2023-10-09 20:48:09 +0000] [13] [INFO] 172.25.13.150:55558 - "GET /docs HTTP/1.1" 200
2023-10-09 20:48:10,600 | ERROR | normalizer:get_normalized_nodes | Exception: Traceback (most recent call last):
File "/code/node_normalizer/normalizer.py", line 559, in get_normalized_nodes
eqids2, types2 = await get_eqids_and_types(app, all_other_ids)
File "/code/node_normalizer/normalizer.py", line 486, in get_eqids_and_types
types = [get_ancestors(app, t) for t in types]
File "/code/node_normalizer/normalizer.py", line 486, in <listcomp>
types = [get_ancestors(app, t) for t in types]
File "/code/node_normalizer/normalizer.py", line 24, in get_ancestors
a = app.state.toolkit.get_ancestors(input_type)
File "/usr/local/lib/python3.9/site-packages/bmt/util.py", line 120, in wrapper
case = guess_casing(s)
File "/usr/local/lib/python3.9/site-packages/bmt/util.py", line 65, in guess_casing
if "_" in s:
TypeError: argument of type 'NoneType' is not iterable
Which appears to be triggered by these lines in NodeNorm:
The issue is in the following chunk of code:
So it looks like the one of the values in types
is coming out as None.
What would happen if we changed any Nones in that list to biolink:NamedThing
? Since the conflation code combines types from every identifier in the conflation, that should skip over any IDs that aren't clique leaders, and as long as at least one clique leader has a type, we should end up using that for the entire conflation. And if there are any conflations without a single clique leader, we would report that as having a type of biolink:NamedThing
without causing a 500 error, which would still flag to us that something is wrong.
@cbizon Do you think that would be a better solution than my overkill PR (PR https://github.com/TranslatorSRI/Babel/pull/217)?
I am ok with this solution in that it would work as a short-term, and it's preferable to the Babel solution that removes too much. But I think that the right answer is to fix Babel so that it doesn't make bad conflations (but in a more careful way than 217).
As reported by @EvanDietzMorris, querying PUBCHEM.COMPOUND:440055 without drug_chemical_conflate works fine: https://nodenormalization-sri.renci.org/1.4/get_normalized_nodes?curie=PUBCHEM.COMPOUND%3A440055&conflate=true&drug_chemical_conflate=false&description=false
But querying it with drug_chemical_conflate turned on consistently causes an error during processing: https://nodenormalization-sri.renci.org/1.4/get_normalized_nodes?curie=PUBCHEM.COMPOUND%3A440055&conflate=true&drug_chemical_conflate=true&description=false