NodeSynonymizer overriding DSL types?

dkoslicki commented 4 years ago

Note the following DSL:

add_qnode(name=DOID:8398, id=n00)
add_qnode(type=phenotypic_feature, id=n01)
add_qedge(source_id=n00, target_id=n01, type=has_phenotype, id=e00)
expand(edge_id=e00)
overlay(action=overlay_clinical_info, paired_concept_frequency=true)
resultify()
return(message=true, store=true)

One should expect one source disease and a bunch of phenotypic_features, but running this at arax.rtx.ai, I get a whole ton of diseases mixed in the results too.

eg: result number 1 is transient arthritis with the type disease in the results tab. Curiously though, in the summary tab, the type for transient arthritis is given as phenotypic_feature.

This might be:

A display issue (@isbluis ?)
An issue with how essences are being populated (@edeutsch ?)
An issue with the synonymizer and node "coalescing" (@edeutsch and @amykglen)

edeutsch commented 4 years ago

Almost certainly not 1. Quite possibly 2. Possibly also 3.

the root of the issue as you guessed is probably the NodeSynonymizer. There is terrible conflation of disease and phenotypic_feature. In KG2 we have:

HP:0002758  Osteoarthritis  phenotypic_feature
MESH:D010003    Osteoarthritis  named_thing
NCIT:C3293  Osteoarthritis  disease
EFO:0002506 osteoarthritis  disease
MONDO:0005178   osteoarthritis  disease
DOID:8398   osteoarthritis  disease
... and much much more...

The question that came up before (#861) is how aggressive do we want the NodeSynonymizer to be? Is there really a difference between osteoarthritis the disease vs. osteoarthritis the phenotypic_feature? I would argue no. But then queries that ask for phenotypic_features of a disease will go off the rails when they are considered synonyms. The root cause is conflation of diseases and phenotypes, but I don't know how to solve it.

isbluis commented 4 years ago

Just to add to @edeutsch's reply, the UI is quite basic in that it just displays whatever is returned by the various API calls.

In the case of transient arthritis, it is designated as a disease in the node definition of the knowledge_graph:

{
        "description": "None",
        "id": "MONDO:0002204",
        "name": "transient arthritis",
        "node_attributes": [],
        "qnode_ids": [
          "n01"
        ],
        "symbol": null,
        "type": [
          "disease"
        ],
        "uri": "http://purl.obolibrary.org/obo/HP_0003040"
      },

It is also designated as a disease by the entity API call, which populates the type in the list mechanism (e.g. when you click on the [+A] link that adds essences to a list) :

{
  "curie": "MONDO:0002204",
  "name": "transient arthritis",
  "type": "disease"
}

But the summary tab is populated by the row_data attribute within each result, and in this case it is designated as a _phenotypicfeature :

{
      "confidence": 0.555,
      "description": "No description available",
      "edge_bindings": [
        {
          "kg_id": "KG1:3724097",
          "qg_id": "e00"
        }
      ],
      "essence": "transient arthritis",
      "essence_type": "phenotypic_feature",
      "id": "https://arax.rtx.ai/api/rtx/v1/result/108472",
      "node_bindings": [
        {
          "kg_id": "MONDO:0012894",
          "qg_id": "n00"
        },
        {
          "kg_id": "MONDO:0002204",
          "qg_id": "n01"
        }
      ],
      "reasoner_id": "ARAX",
      "result_graph": null,
      "result_group": null,
      "result_group_similarity_score": null,
      "result_type": "individual query answer",
      "row_data": [
        0.555,
        "transient arthritis",
        "phenotypic_feature"
      ],
      "score": null,
      "score_direction": null,
      "score_name": null
    },

If it helps, you can typically view the json responses by examining the Dev Info tab.

amykglen commented 4 years ago

re: 3) - when expand coalesces nodes, it currently makes the coalesced nodes' type be the 'preferred type' for that synonym group according to the NodeSynonymizer (via get_canonicalized_curies()). so due to what @edeutsch noted about conflation of disease and phenotypic features, I think sometimes the 'preferred type' is 'disease', even though the original node was a 'phenotypic_feature'.

amykglen commented 4 years ago

an alternative to using the single 'preferred type' would be to assign coalesced nodes the entire list of types for that synonym group. (e.g., [disease, phenotypic_feature, named_thing].) if expand did that, then every returned n01 node's type would contain 'phenotypic_feature' (among other things) in @dkoslicki's query.

edeutsch commented 4 years ago

It's possible that the main issue lies with HP (in this case, but also in many more I think):

HP:0002758  Osteoarthritis  phenotypic_feature

I would argue that osteoarthritis is not a phenotypic feature. It is a disease or a diagnosis or a condition, but not a phenotypic feature. "Inflammation of joints" seems like a phenotypic feature. One that might be caused by disease osteoarthritis but could be caused by other things. Maybe @jaredroach or others have a more informed opinion than I do on this point.

@saramsey I wonder if there is perhaps some finer grained disambiguation in HP between diseases and phenotypic features that we are missing during ETL?

If not, perhaps one rule we could apply as a bandaid (perhaps in NodeSynonymizer or perhaps in KG2 from the start or as a later patch) is: IF an HP node has a name that is identical to other nodes from different sources (like MONDO or DOID) that are diseases, then switch its type to 'disease' instead of 'phenotypic_feature'. That might go a long way toward fixing these conflation problems.

As a general note, as I mentioned in #861, this is a pervasive issue that may hurt us. We should collect examples such as this, and maybe we can can find a rule to apply a patch to KG2 that alleviates the problem a bit.

dkoslicki commented 4 years ago

an alternative to using the single 'preferred type' would be to assign coalesced nodes the entire list of types for that synonym group. (e.g., [disease, phenotypic_feature, named_thing].) if expand did that, then every returned n01 node's type would contain 'phenotypic_feature' (among other things) in @dkoslicki's query.

This returning of a list of types might be a good workaround in the interim until we figure out #861 (as that will be no easy fix). If a user sees that phenotypic_feature shows up along-side other types, I think they would be happy (and it would be informative to know that the bioentity can be interpreted in multiple ways)

amykglen commented 4 years ago

I like that idea. If we go that route, would you be able to add the list of types to get_canonical_curies' output, @edeutsch? (I believe the type list is returned with get_normalizer_results(), but it'd be great if it was included in the speedy get_canonical_curies().)

edeutsch commented 4 years ago

I certainly could, but I'm not understanding what that would gain us. I would probably make a separate method call because it would definitely slow things down. And a lot of coalesced concepts have a whole raft of puzzling concepts.

But I'm wondering about an alternate perhaps easier solution: if you have a whole bunch of nodes that map to a certain query_graph qnode, why not just force them to be the type of the query_node? Would that simplify things? I think that's actually what happens in the summary table. All items there are phenotypic_feature because that's what the query_graph.qnode says it should be.

Insulin comes back with the following types:

    "type": [
      "protein",
      "named_thing",
      "chemical_substance",
      "anatomical_entity",
      "disease",
      "gene",
      "activity",
      "drug",
      "substance",
      "gene_set",
      "phenotypic_feature",
      "metabolite"
    ]

admittedly an extreme example.

Anyway, if y'all still think that providing the whole list is preferable to just forcing to the qnode.type, then I'll do it!

amykglen commented 4 years ago

true - I'm not thinking of a real downside to that...

(though I think I will still need a way to get full type lists from the NodeSynonymizer for the canonicalized KG2 build process, as FET will need those lists to be stored on KG2C nodes - but I can request that elsewhere.)

jaredroach commented 4 years ago

Indeed, this is a fundamental semiotic conundrum that medical education has spent centuries grappling with. I suspect most doctors have difficulties understanding the difference between disease and phenotype. That difference is drilled into them in a structured format in the first year of medical school. And absolutely 'osteoarthritis" is a disease. It is definitely not a phenotype. A red joint is a phenotype. It is absolutely wrong for HPO to call 'osteoarthritis' a 'phenotype'. Settled, right? But wait. Anyone talked to a geneticist recently? What is the phenotype of the genetic defect at the OMIM #165720 (OSTEOARTHRITIS SUSCEPTIBILITY 1; OS1) locus? It is osteoarthritis, because phenotypes in genetics can be diseases.

How do we solve this (I'm thinking of a song from the Sound of Music here). It is one example of an extremely difficult NLP problem. What does the user truly mean when they use the word 'phenotype'? Are they a doctor or a geneticist or both? How good was their education? Are they a stickler for semantics?

Practically speaking, for purposes of Team Expander, my problem is that some useful edges happen to connect to nodes of type 'phenotype' and some to type 'disease'. There doesn't seem to be any particular important difference other than the database the edges were extracted from and the metadata associated with those edges in those particular databases. My main argument for grouping everything together is to avoid false negatives when adding edges during expansion.

dkoslicki commented 4 years ago

New proposal: over-ride the KG and results so the types are what is specified in the DSL (like is currently done in the summary tab of the UI)

amykglen commented 4 years ago

alright - just implemented this in expand and pushed it to master. (all expand tests are passing as well as the entire 'fast' test suite for all modules.)

amykglen commented 4 years ago

confirmed this is working as expected on /test. (all n01 nodes have a type of phenotypic_feature in the results for the original query reported.)

RTXteam / RTX

NodeSynonymizer overriding DSL types? #987