Closed dkoslicki closed 4 years ago
Almost certainly not 1. Quite possibly 2. Possibly also 3.
the root of the issue as you guessed is probably the NodeSynonymizer. There is terrible conflation of disease and phenotypic_feature. In KG2 we have:
HP:0002758 Osteoarthritis phenotypic_feature
MESH:D010003 Osteoarthritis named_thing
NCIT:C3293 Osteoarthritis disease
EFO:0002506 osteoarthritis disease
MONDO:0005178 osteoarthritis disease
DOID:8398 osteoarthritis disease
... and much much more...
The question that came up before (#861) is how aggressive do we want the NodeSynonymizer to be? Is there really a difference between osteoarthritis the disease vs. osteoarthritis the phenotypic_feature? I would argue no. But then queries that ask for phenotypic_features of a disease will go off the rails when they are considered synonyms. The root cause is conflation of diseases and phenotypes, but I don't know how to solve it.
Just to add to @edeutsch's reply, the UI is quite basic in that it just displays whatever is returned by the various API calls.
In the case of transient arthritis
, it is designated as a disease in the node
definition of the knowledge_graph
:
{
"description": "None",
"id": "MONDO:0002204",
"name": "transient arthritis",
"node_attributes": [],
"qnode_ids": [
"n01"
],
"symbol": null,
"type": [
"disease"
],
"uri": "http://purl.obolibrary.org/obo/HP_0003040"
},
It is also designated as a disease by the entity
API call, which populates the type in the list mechanism (e.g. when you click on the [+A] link that adds essences to a list) :
{
"curie": "MONDO:0002204",
"name": "transient arthritis",
"type": "disease"
}
But the summary tab is populated by the row_data
attribute within each result
, and in this case it is designated as a _phenotypicfeature :
{
"confidence": 0.555,
"description": "No description available",
"edge_bindings": [
{
"kg_id": "KG1:3724097",
"qg_id": "e00"
}
],
"essence": "transient arthritis",
"essence_type": "phenotypic_feature",
"id": "https://arax.rtx.ai/api/rtx/v1/result/108472",
"node_bindings": [
{
"kg_id": "MONDO:0012894",
"qg_id": "n00"
},
{
"kg_id": "MONDO:0002204",
"qg_id": "n01"
}
],
"reasoner_id": "ARAX",
"result_graph": null,
"result_group": null,
"result_group_similarity_score": null,
"result_type": "individual query answer",
"row_data": [
0.555,
"transient arthritis",
"phenotypic_feature"
],
"score": null,
"score_direction": null,
"score_name": null
},
If it helps, you can typically view the json
responses by examining the Dev Info tab.
re: 3) - when expand
coalesces nodes, it currently makes the coalesced nodes' type
be the 'preferred type' for that synonym group according to the NodeSynonymizer
(via get_canonicalized_curies()
). so due to what @edeutsch noted about conflation of disease and phenotypic features, I think sometimes the 'preferred type' is 'disease', even though the original node was a 'phenotypic_feature'.
an alternative to using the single 'preferred type' would be to assign coalesced nodes the entire list of types for that synonym group. (e.g., [disease, phenotypic_feature, named_thing]
.) if expand
did that, then every returned n01 node's type
would contain 'phenotypic_feature' (among other things) in @dkoslicki's query.
It's possible that the main issue lies with HP (in this case, but also in many more I think):
HP:0002758 Osteoarthritis phenotypic_feature
I would argue that osteoarthritis is not a phenotypic feature. It is a disease or a diagnosis or a condition, but not a phenotypic feature. "Inflammation of joints" seems like a phenotypic feature. One that might be caused by disease osteoarthritis but could be caused by other things. Maybe @jaredroach or others have a more informed opinion than I do on this point.
@saramsey I wonder if there is perhaps some finer grained disambiguation in HP between diseases and phenotypic features that we are missing during ETL?
If not, perhaps one rule we could apply as a bandaid (perhaps in NodeSynonymizer or perhaps in KG2 from the start or as a later patch) is: IF an HP node has a name that is identical to other nodes from different sources (like MONDO or DOID) that are diseases, then switch its type to 'disease' instead of 'phenotypic_feature'. That might go a long way toward fixing these conflation problems.
As a general note, as I mentioned in #861, this is a pervasive issue that may hurt us. We should collect examples such as this, and maybe we can can find a rule to apply a patch to KG2 that alleviates the problem a bit.
an alternative to using the single 'preferred type' would be to assign coalesced nodes the entire list of types for that synonym group. (e.g.,
[disease, phenotypic_feature, named_thing]
.) ifexpand
did that, then every returned n01 node'stype
would contain 'phenotypic_feature' (among other things) in @dkoslicki's query.
This returning of a list of types might be a good workaround in the interim until we figure out #861 (as that will be no easy fix). If a user sees that phenotypic_feature
shows up along-side other types, I think they would be happy (and it would be informative to know that the bioentity can be interpreted in multiple ways)
I like that idea. If we go that route, would you be able to add the list of types to get_canonical_curies
' output, @edeutsch? (I believe the type list is returned with get_normalizer_results()
, but it'd be great if it was included in the speedy get_canonical_curies()
.)
I certainly could, but I'm not understanding what that would gain us. I would probably make a separate method call because it would definitely slow things down. And a lot of coalesced concepts have a whole raft of puzzling concepts.
But I'm wondering about an alternate perhaps easier solution: if you have a whole bunch of nodes that map to a certain query_graph qnode, why not just force them to be the type of the query_node? Would that simplify things? I think that's actually what happens in the summary table. All items there are phenotypic_feature because that's what the query_graph.qnode says it should be.
Insulin comes back with the following types:
"type": [
"protein",
"named_thing",
"chemical_substance",
"anatomical_entity",
"disease",
"gene",
"activity",
"drug",
"substance",
"gene_set",
"phenotypic_feature",
"metabolite"
]
admittedly an extreme example.
Anyway, if y'all still think that providing the whole list is preferable to just forcing to the qnode.type, then I'll do it!
true - I'm not thinking of a real downside to that...
(though I think I will still need a way to get full type lists from the NodeSynonymizer for the canonicalized KG2 build process, as FET will need those lists to be stored on KG2C nodes - but I can request that elsewhere.)
Indeed, this is a fundamental semiotic conundrum that medical education has spent centuries grappling with. I suspect most doctors have difficulties understanding the difference between disease and phenotype. That difference is drilled into them in a structured format in the first year of medical school. And absolutely 'osteoarthritis" is a disease. It is definitely not a phenotype. A red joint is a phenotype. It is absolutely wrong for HPO to call 'osteoarthritis' a 'phenotype'. Settled, right? But wait. Anyone talked to a geneticist recently? What is the phenotype of the genetic defect at the OMIM #165720 (OSTEOARTHRITIS SUSCEPTIBILITY 1; OS1) locus? It is osteoarthritis, because phenotypes in genetics can be diseases.
How do we solve this (I'm thinking of a song from the Sound of Music here). It is one example of an extremely difficult NLP problem. What does the user truly mean when they use the word 'phenotype'? Are they a doctor or a geneticist or both? How good was their education? Are they a stickler for semantics?
Practically speaking, for purposes of Team Expander, my problem is that some useful edges happen to connect to nodes of type 'phenotype' and some to type 'disease'. There doesn't seem to be any particular important difference other than the database the edges were extracted from and the metadata associated with those edges in those particular databases. My main argument for grouping everything together is to avoid false negatives when adding edges during expansion.
New proposal: over-ride the KG and results so the types are what is specified in the DSL (like is currently done in the summary tab of the UI)
alright - just implemented this in expand
and pushed it to master
. (all expand
tests are passing as well as the entire 'fast' test suite for all modules.)
confirmed this is working as expected on /test
. (all n01
nodes have a type of phenotypic_feature
in the results for the original query reported.)
Note the following DSL:
One should expect one source
disease
and a bunch ofphenotypic_features
, but running this at arax.rtx.ai, I get a whole ton of diseases mixed in the results too.eg: result number 1 is
transient arthritis
with the typedisease
in theresults
tab. Curiously though, in thesummary
tab, the type fortransient arthritis
is given asphenotypic_feature
.This might be: