TranslatorSRI / NodeNormalization

Service that produces Translator compliant nodes given a curie
MIT License
9 stars 6 forks source link

Reproducible error when using drug_chemical_conflate #221

Closed gaurav closed 6 months ago

gaurav commented 8 months ago

As reported by @EvanDietzMorris, querying PUBCHEM.COMPOUND:440055 without drug_chemical_conflate works fine: https://nodenormalization-sri.renci.org/1.4/get_normalized_nodes?curie=PUBCHEM.COMPOUND%3A440055&conflate=true&drug_chemical_conflate=false&description=false

But querying it with drug_chemical_conflate turned on consistently causes an error during processing: https://nodenormalization-sri.renci.org/1.4/get_normalized_nodes?curie=PUBCHEM.COMPOUND%3A440055&conflate=true&drug_chemical_conflate=true&description=false

gaurav commented 8 months ago

The problem appears to be that PUBCHEM.COMPOUND:440055 is conflated with RXCUI:2642190, which is missing from NodeNorm for some reason: https://nodenormalization-sri.renci.org/1.4/get_normalized_nodes?curie=RXCUI:2642190&conflate=true&drug_chemical_conflate=false&description=false

This apparently causes an exception to be thrown when trying to figure out the types of this identifier:

[2023-10-09 20:48:09 +0000] [13] [INFO] 172.25.13.150:55558 - "GET /docs HTTP/1.1" 200
2023-10-09 20:48:10,600 | ERROR | normalizer:get_normalized_nodes | Exception: Traceback (most recent call last):
  File "/code/node_normalizer/normalizer.py", line 559, in get_normalized_nodes
    eqids2, types2 = await get_eqids_and_types(app, all_other_ids)
  File "/code/node_normalizer/normalizer.py", line 486, in get_eqids_and_types
    types = [get_ancestors(app, t) for t in types]
  File "/code/node_normalizer/normalizer.py", line 486, in <listcomp>
    types = [get_ancestors(app, t) for t in types]
  File "/code/node_normalizer/normalizer.py", line 24, in get_ancestors
    a = app.state.toolkit.get_ancestors(input_type)
  File "/usr/local/lib/python3.9/site-packages/bmt/util.py", line 120, in wrapper
    case = guess_casing(s)
  File "/usr/local/lib/python3.9/site-packages/bmt/util.py", line 65, in guess_casing
    if "_" in s:
TypeError: argument of type 'NoneType' is not iterable
gaurav commented 8 months ago

It looks like there are 41 cliques on https://nodenormalization-sri.renci.org/ (current NodeNorm Dev) that currently cause this error:

EvanDietzMorris commented 8 months ago

more of them: PUBCHEM.COMPOUND:5352133 PUBCHEM.COMPOUND:5353622 PUBCHEM.COMPOUND:5742832 PUBCHEM.COMPOUND:5702160 PUBCHEM.COMPOUND:162533872 PUBCHEM.COMPOUND:5284447 PUBCHEM.COMPOUND:5479530

gaurav commented 8 months ago

Note the oddity of PUBCHEM.COMPOUND:124220636, which should only be conflated with PUBCHEM.COMPOUND:165411920 and UMLS:C5402366, all three of which have type information.

Update: this appears to be caused by the middle ID, but then how come it works when conflation is turned off without problems?

No type information found for 'PUBCHEM.COMPOUND:165411920' with eqids: None.

Update: PUBCHEM.COMPOUND:165411920 is part of the PUBCHEM.COMPOUND:124220636 clique. Perhaps that is confusing the conflation algorithm somehow?

cbizon commented 8 months ago

Right, so PUBCHEM.COMPOUND:165411920 really doesn't have a type. The type database only uses the preferred identifier to retrieve the type. The problem is that the conflation contains something other than a preferred identifier. In conflation, it is assuming that those are all preferred, and will therefore all have entries in the right places.

It works when you just go with *20 because it first hits the main index, finds the preferred ID and then uses that to go look up the type.

The upshot of which: Your more careful response to the error is all that can be done in nodenorm, and the real problem is in Babel (oops).

gaurav commented 8 months ago

Here are the cliques for those three identifiers from our previous run. As you said, it's only two cliques, since PUBCHEM.COMPOUND:124220636 is the clique leader for PUBCHEM.COMPOUND:165411920.

{"type": "biolink:MolecularMixture", "identifiers": [{"i": "PUBCHEM.COMPOUND:124220636", "l": "Copper dotatate Cu-64", "d": []}, {"i": "PUBCHEM.COMPOUND:165411920", "l": "Detectnet", "d": []}, {"i": "CHEMBL.COMPOUND:CHEMBL4297339", "l": "COPPER OXODOTREOTIDE CU-64", "d": []}, {"i": "UNII:N3858377KC", "l": "COPPER OXODOTREOTIDE CU-64", "d": []}, {"i": "DRUGBANK:DB15873", "d": []}, {"i": "MESH:C000718307", "l": "copper dotatate CU-64", "d": []}, {"i": "MESH:C575629", "l": "64Cu-DOTATATE", "d": []}, {"i": "DrugCentral:5411", "l": "copper dotatate Cu-64", "d": []}, {"i": "HMDB:HMDB0304900", "l": "Copper Cu 64 Dotatate", "d": []}, {"i": "INCHIKEY:IJRLLVFQGCCPPI-NVGRTJHCSA-L", "d": []}, {"i": "UMLS:C3502191", "l": "copper oxodotreotide CU-64", "d": []}, {"i": "RXCUI:2396442", "d": []}]}
{"type": "biolink:ChemicalEntity", "identifiers": [{"i": "UMLS:C5402366", "l": "dodatate", "d": []}, {"i": "RXCUI:2396443", "d": []}]}

The Babel regeneration stalled sometime this morning, but I've restarted it now. Once that's done I'll confirm that it also has the incorrect conflation, but I think I can assume that that will be the case and look for issues in the DrugChemical conflation code in the meantime.

gaurav commented 8 months ago

I've modified the code in PR https://github.com/TranslatorSRI/Babel/pull/191 to skip identifiers where the object isn't present in either drug_rxcui_to_clique or chemical_rxcui_to_clique, which are mappings from RXCUIs to clique leaders. This seems to have eliminated the PUBCHEM.COMPOUND identifiers that we don't actually want conflated, but I'm pretty sure that it's skipping too many non-RXCUI mappings that we actually do want in these conflated cliques. Here is a list of the 11,737 subject-object pairs being skipped by this change: in 4,508 cases this is a mapping from an identifier to itself (e.g. PUBCHEM.COMPOUND:6914273 to PUBCHEM.COMPOUND:6914273), while in the other 7,229 cases it's between two different identifiers.

The dumb next move would to remove the RXCUI filters entirely from load_cliques(), so then we can figure out the clique leader for every identifier we want to conflate (at the cost of loading all the chemical identifiers back into memory, but whatever). @cbizon Do you think that makes sense, or can you come up with a smarter way to make sure the DrugChemical file only contains clique leaders?

cbizon commented 8 months ago

If we were concerned about keeping memory low, the only other thing I can think of would be to generate the DrugChemical conflation as before, but then do a post-process on it where we load that whole thing into memory and then cycle over the chemical cliques. Each time we read a new clique leader we check the memory DC conflation, and if we find it in there, then we move it over into a new, cleaned conflation, which we write out at the end. But honestly, it seems like overkill. Probably best just to load all the clique leader IDs and be done with it.

I think you're right that the code in the PR is removing too much - the object for some of these is still a clique leader, even if it's not in those maps.

gaurav commented 7 months ago

I've been thinking about this issue some, and I wonder if there's a simpler, more comprehensive fix that we can make in NodeNorm itself. The error is the following:

[2023-10-09 20:48:09 +0000] [13] [INFO] 172.25.13.150:55558 - "GET /docs HTTP/1.1" 200
2023-10-09 20:48:10,600 | ERROR | normalizer:get_normalized_nodes | Exception: Traceback (most recent call last):
  File "/code/node_normalizer/normalizer.py", line 559, in get_normalized_nodes
    eqids2, types2 = await get_eqids_and_types(app, all_other_ids)
  File "/code/node_normalizer/normalizer.py", line 486, in get_eqids_and_types
    types = [get_ancestors(app, t) for t in types]
  File "/code/node_normalizer/normalizer.py", line 486, in <listcomp>
    types = [get_ancestors(app, t) for t in types]
  File "/code/node_normalizer/normalizer.py", line 24, in get_ancestors
    a = app.state.toolkit.get_ancestors(input_type)
  File "/usr/local/lib/python3.9/site-packages/bmt/util.py", line 120, in wrapper
    case = guess_casing(s)
  File "/usr/local/lib/python3.9/site-packages/bmt/util.py", line 65, in guess_casing
    if "_" in s:
TypeError: argument of type 'NoneType' is not iterable

Which appears to be triggered by these lines in NodeNorm:

The issue is in the following chunk of code:

https://github.com/TranslatorSRI/NodeNormalization/blob/68096b2f16e6c2eedb699178ace71cea98dc794f/node_normalizer/normalizer.py#L485-L486

So it looks like the one of the values in types is coming out as None.

What would happen if we changed any Nones in that list to biolink:NamedThing? Since the conflation code combines types from every identifier in the conflation, that should skip over any IDs that aren't clique leaders, and as long as at least one clique leader has a type, we should end up using that for the entire conflation. And if there are any conflations without a single clique leader, we would report that as having a type of biolink:NamedThing without causing a 500 error, which would still flag to us that something is wrong.

@cbizon Do you think that would be a better solution than my overkill PR (PR https://github.com/TranslatorSRI/Babel/pull/217)?

cbizon commented 7 months ago

I am ok with this solution in that it would work as a short-term, and it's preferable to the Babel solution that removes too much. But I think that the right answer is to fix Babel so that it doesn't make bad conflations (but in a more careful way than 217).