Closed jh111 closed 2 years ago
@karafecho
@jh111 : What is the edge that COHD is using?
Sorry, to clarify, COHD is still using the biolink:correlated_with
predicate, but what we're doing is shoving more information into the edge attributes. Originally, we would use the non-standard query_options
in the TRAPI query to indicate which type of association metric to calculate. Now, we just calculate all the association metrics (chi-square, relative frequency, and observed-expected frequency ratio) and return them all on the edge attributes. Example below.
However, we haven't determined exactly what attribute_type_id
s to use yet, so we just have some placeholders in for now until we receive additional guidance. Since the attribute_type_id
s are currently not unique or very specific, we will need to figure these out before they can be useful.
Using the relative_frequency_subject
or relative_frequency_object
attributes (original_attribute_name
), we can find drugs that have a high proportion of patients with a given disease. Currently, the client would have to filter through the edges to find the high values, but in the future, edge constraints could potentially be used to do this in COHD.
"edges": {
"ke000000": {
"attributes": [
{
"attribute_source": "COHD",
"attribute_type_id": "biolink:p_value",
"description": "Chi-square p-value, unadjusted. http://cohd.io/about.html",
"original_attribute_name": "p-value",
"value": 1.1969745232677346e-126,
"value_type_id": "EDAM:data_1669",
"value_url": "http://edamontology.org/data_1669"
},
{
"attribute_source": "COHD",
"attribute_type_id": "biolink:p_value",
"description": "Chi-square p-value, Bonferonni adjusted by number of pairs of concepts. http://cohd.io/about.html",
"original_attribute_name": "p-value adjusted",
"value": 6.998590340094117e-122,
"value_type_id": "EDAM:data_1669",
"value_url": "http://edamontology.org/data_1669"
},
{
"attribute_source": "COHD",
"attribute_type_id": "biolink:has_evidence",
"description": "Observed-expected frequency ratio. http://cohd.io/about.html",
"original_attribute_name": "ln_ratio",
"value": 3.653252276470785,
"value_type_id": "EDAM:data_1772"
},
{
"attribute_source": "COHD",
"attribute_type_id": "biolink:has_confidence_level",
"description": "Observed-expected frequency ratio 0.99% confidence interval",
"original_attribute_name": "ln_ratio_confidence_interval",
"value": [
2.3314964364884654,
4.312497905355049
],
"value_type_id": "EDAM:data_0951"
},
{
"attribute_source": "COHD",
"attribute_type_id": "biolink:has_evidence",
"description": "Relative frequency, relative to the subject node. http://cohd.io/about.html",
"original_attribute_name": "relative_frequency_subject",
"value": 0.1,
"value_type_id": "EDAM:data_1772"
},
{
"attribute_source": "COHD",
"attribute_type_id": "biolink:has_confidence_level",
"description": "Relative frequency (subject) 0.99% confidence interval",
"original_attribute_name": "relative_freq_subject_confidence_interval",
"value": [
0.020833333333333332,
0.26126126126126126
],
"value_type_id": "EDAM:data_0951"
},
{
"attribute_source": "COHD",
"attribute_type_id": "biolink:has_evidence",
"description": "Relative frequency, relative to the object node. http://cohd.io/about.html",
"original_attribute_name": "relative_frequency_object",
"value": 1.5,
"value_type_id": "EDAM:data_1772"
},
{
"attribute_source": "COHD",
"attribute_type_id": "biolink:has_confidence_level",
"description": "Relative frequency (object) 0.99% confidence interval",
"original_attribute_name": "relative_freq_object_confidence_interval",
"value": [
0.18181818181818182,
14.5
],
"value_type_id": "EDAM:data_0951"
},
{
"attribute_source": "COHD",
"attribute_type_id": "biolink:has_count",
"description": "Observed concept count between the pair of subject and object nodes",
"original_attribute_name": "concept_pair_count",
"value": 15,
"value_type_id": "EDAM:data_0006"
},
{
"attribute_source": "COHD",
"attribute_type_id": "biolink:has_count",
"description": "Observed concept count of the subject node",
"original_attribute_name": "concept_count_subject",
"value": 150,
"value_type_id": "EDAM:data_0006"
},
{
"attribute_source": "COHD",
"attribute_type_id": "biolink:has_count",
"description": "Observed concept count of the object node",
"original_attribute_name": "concept_count_object",
"value": 10,
"value_type_id": "EDAM:data_0006"
},
{
"attribute_source": "COHD",
"attribute_type_id": "EDAM:operation_3438",
"description": "Calculated expected count of concept pair. For ln_ratio. http://cohd.io/about.html",
"original_attribute_name": "expected_count",
"value": 0.38860103626943004,
"value_type_id": "EDAM:operation_3438"
},
{
"attribute_source": "COHD",
"attribute_type_id": "biolink:provided_by",
"description": "Dataset ID within COHD",
"original_attribute_name": "dataset_id",
"value": 3,
"value_type_id": "EDAM:data_1048"
}
],
"object": "UMLS:C4039003",
"predicate": "biolink:correlated_with",
"subject": "MONDO:0021113"
}
}
Looping in @xu-hao...
ICEES is also using the predicate biolink:correlated_with and treating edge attributes similar to COHD. ICEES returns frequencies, chi square statistics, and p values (with and without corrections for multiple comparison) for drug-disease associations, which could be used to rank the associations.
One limitation of ICEES is the finite set of diseases that are represented. For instance, we currently do not expose data on patients with multiple sclerosis. So, for ICEES to be able to respond to a Workflow C query, the choice of disease will be important.
FYI, Matt said he'll start working on edge attributes next week and asked us to update our examples in this spreadsheet. Just FYI in case you all have new attributes you want to update in the spreadsheet also.
I am on vacation this week, but now that the source retrieval provenance modeling work is wrapping up, am set to tackle evidence-related edge metadata wen I return next week. My plan is to use the attribute examples in the 'edge attributes' sheet of the 'attribute_types' spreadsheet that the TRAPI team collected a couple months back as the initial set of requirements to try and support. https://docs.google.com/spreadsheets/d/1-ilDWePMLniA9Tha5J6HHHFylO5w6lZ9FFfA2Mp5oro/edit#gid=0. I have already begun mapping many of the entries in this sheet to proposed biolink properties (as you can see in my curation/notes in columns G-K. I see that the rows entered by COHD (starting at row 104) cover much of what you shared above, but there are some differences. If you would be so kinds as to update the COHD rows to reflect the current set of edge attributes you report above, that would be great. (Columns A-F are those that the KPs are to fill in. Columns G and after are my own notes/curation). (edited) Specifically, if you all have defined are new attribute_types not in the sheet, add a row for them. And if there are attributes in the sheet you no longer use, strike through the text (Alt+shift+5).
Adding @rtroper.
Will be discussed tomorrow in Clinical Data committee @karafecho
Here's a quick sample of what information may come in if we look at the relative frequency data coming from COHD. Only "strong-ish" hit from COHD is Cromolyn Sodium 20 MG/ML Oral Solution. The meaning of this metric is essentially telling us that among patients observed with Cromolyn Sodium 20 MG/ML Oral Solution
, 20% of them have Ehlers-Danlos
. Although the odd thing is that this finding may be specific to this dose/formulation of Cromolyn, as Cromolyn in general has 1.4% relative frequency.
For reference, here are some results that would come in for the basic biolink:correlated_with
query (the score is the log ratio column)
It's long, but what about something like: has_real_world_evidence_of_association_with
or has_clinical_evidence_of_association_with
?
@rtroper I slightly prefer the has_real_world_evidence
variant over has_clinical_evidence
since this allows expansion to other forms of RWE in the future.
However, @jh111, were you hoping that TextMiner could use the same predicate? If so, then we'd probably have to drop RWE.
If we don't need to share a predicate with TextMiner, what if we use either an abstract parent predicate has_real_world_evidence
or a mixin for real_world_evidence
and then have just have biolink:associated_with
, biolink:positively_associated_with
, and biolink:negatively_associated_with
under that? Would make the predicate much shorter, but perhaps makes it non-obvious that it's specific to RWE unless people are already familiar with it. After scrolling through the model, it looks like Biolink doesn't shy away from long names, so perhaps the full biolink:has_real_world_evidence_of_association_with
is better.
All clinical KPs could return on the same predicate, but we can also potentially ask the data modeling team to mint new association slots that we can use for edge attributes which distinguish the data from the various clinical KPs. For example, something like biolink:has_chi_square_p_value
, biolink:has_ln_ratio
, biolink:has_relative_frequency
, and biolink:has_logistic_regression_coefficient
.
For the record, there was consensus in our 7/16 clinical WG meeting to go with has_real_world_evidence_of_association_with
. @jh111 since you weren't in the meeting, let us know if you have any objections. We plan on inviting Matt Brush to the next clinical WG meeting to talk about adding this predicate to the biolink model.
+1
Note that Matt B. has confirmed attendance at our next clinical data committee meeting on 7/23. I plan to invite Sierra M., too.
Hi @rtroper @karafecho @jh111 @mikebada @mbrush :) - We discussed this in the predicates working group today too. We were hoping to surface our discussions at the DM call next week (or the following week).
In particular, we've been operating under the following guidelines to classify something as a predicate vs. as an edge qualifier (aka: edge property): 1) Predicates would be the "what you know" about an association. 2) Qualifiers of the edge (aka: edge properties, aka association slots) would be the "how you know" about an association.
So in this case, we would be prompted to make an "associated with" predicate (though we need to have a clear definition for "associated with" if it is different than "related to" and different than "correlated with"), and an epistemic qualifier "supported by real-world evidence" (and another, "supported by clinical evidence" if needed).
The hierarchy of predicates for this example would be something like this:
Associated with (needs definition):
Correlated with (statistical dependence):
Negatively correlated with (mixin for multiple parentages)
Positively correlated with (mixin)
Negatively associated with:
Negatively correlated with (mixin for multiple parentages)
Positively associated with:
Positively correlated with (mixin)
Any of these predicates could have the epistemic qualifier 'supported by real-world evidence, etc.
Would you be willing to come up with definitions for "associated with" (as well as "negatively associated with" and "positively associated with")?
Looping in @CaseyTa and @xu-hao ...
Thanks @sierra-moxon! I think going with an additional epistemic qualifier on the existing biolink:correlated_with
or on a new biolink:associated_with
predicate could work for COHD.
How would this look in TRAPI? Would this be an edge_attribute with something like attribute_type_id biolink:supported_by_real-world_evidence" and value
True`? If so, then we would likely need to have TRAPI edge constraints implemented by all KPs in order to target a query towards clinical data KPs.
Is there / could there be an epistemic qualifier to capture the notion of "predicted by a machine learning model" also? Or would you recommend encoding this in some other manner? This could be of interest for OpenPredict (predicting drug treats disease relationships from a variety of vector embedding sources) and EHR Risk Provider. Can multiple qualifiers be True at the same time? For example, both "predicted by a machine learning model" and "supported by RWE" if the ML prediction was performed on real-world data?
Regarding the definition of has_real_world_evidence_of_association_with
, I don't have a precise definition, but in my mind, I was viewing it somewhere between related_to
and correlated_with
: below related_to
because this edge is supported by real-world data, and above correlated_with
since correlated_with
doesn't very accurately capture the semantics from EHR Risk Provider's logistic regression model. I'm not sure if that's how others were also interpreting it, so please chime in.
Instead of a new associated with
predicate, we could potentially mint a new predicate that's more meaningful for EHR Risk Provider while ICEES and COHD continue to use correlated_with
. We could potentially target a query to all clinical KPs by using related_to
in conjunction with the supported_by_real-world_evidence
qualifier, but again, this risks pulling in data from many KPs unless all KPs implement constraints in time. I'm very open to other suggestions
I also like the idea of using additional qualifiers on the base edge predicate. This seems like a good way to avoid the proliferation of new predicates for every nuanced variation one can think of. I have similar questions to Casey: Would this be implemented as an additional edge attribute, and would KPs and ARAs be able to implement/support this in the relatively near future?
As for defining a new, fairly generic edge predicate like associated_with
or using an existing predicate like related_to
, I might lean, for now, toward just using related_to
. This is very broad, but with the qualifier, it might be okay. Then again, there's the point that Casey made about how soon KPs could implement support for the qualifier.
For now, I don't have a definition of associated_with
that would clearly distinguish it from related_to
, but I'll give it some thought. I'll also think about other possible predicates.
@CaseyTa @rtroper these are great questions! We are definitely walking a bit of a line here between a boolean qualifier that just says 'supported by real-world evidence' and actually quantifying that evidence (possibly with p-values, etc). Are these two use cases (the one where we want to report the statistical edge properties like p-value, and the one where we want to classify an edge as 'real-world' vs. 'clincial') independent? Will be good to hash through these in the meetings this week. :)
I agree with the comments posted by Casey and Ryan.
A couple of additional points:
In terms of a definition of associated_with
, I also don't have a clear definition, just a mental model that is very similar to Casey's.
In my mind, the evidence on statistical properties is independent of the qualifier supported_by_real_world_evidence
.
Clin Data Committee discussing 7/23, please update after this discussion. Sierra and Matt will be attending this as well
We've already discussed having a predicted
qualifier, so I think creating and using a subsumed predicted_by_machine_learning_model
qualifier (and perhaps other subtypes) would be fine if people would find that useful...
If these qualifiers are implemented as booleans, as @CaseyTa brought up, we could name these, e.g., is_predicted
, is_predicted_by_machine_learning_model
, is_supported_by_real_world_evidence
, which could be assigned values of true as needed.
We've also briefly discussed folding these qualifiers into the larger suite of EPC attributes.
More discussions to follow at the Clin Data Committee, not done
This issue was discussed with Sierra M. and Matt B. during last week's clinical data committee meeting. Will also be discussed during today's DM call and tomorrow's clinical data committee meeting. Sierra will be putting forward a proposal with several options for resolving this issue.
Proposal (hierarchy below):
related_to
correlated_with
has_real_world_evidence_of_association_with
TODO: refine the description of this predicate so that it conveys temporary status and how we should use it to return/minimize actionable results for workflow C and B.
Note that the Clinical Data Committee reached consensus on adoption of the new predicate/hierarchy by way of informal vote during today's meeting, during which time only a subset of committee members was present. However, given the timeline for the December demo, as well as the fact that this is a temporary fix, I think that an informal vote is sufficient, but please let me know if others disagree.
` has real world evidence of association with: is_a: related to description: >- this suggests the person has the disease in combination with other triples that use this predicate in_subset:
added in biolink model release 2.2.1 (for temporary use - will be refactored in later releases).
@sierra-moxon: I have a couple of questions.
Thanks!
predicate needed for the workflow C to run is dependent on the qualifiers that the DM team is working on. Decision for now is a stopgap predicate that will work for this case. This predicate has been added to Biolink. However these predicates are in the view version of BL. So we need to move to the new version of BL 2.2.1. Note, this version has some other changes. Possible to issue BL 2.1.1 and have these predicates be part of that if we are worried about 2.2.1 breaking things. @vgardner-renci to add this topic to the Architecture agenda.
Please note that this issue applies to both Workflow B and Workflow C, even though it originated with Workflow C. Also, please note that we need to resolve the definition of the new predicate, in addition to the versioning issues.
@sierra-moxon: I have a couple of questions.
- I'm not sure I like the proposed description, as it suggests a specific analytic approach to define the assertion, but it does not provide sufficient details to be useful, imho. I think the original intent was to emphasize the "real-world evidence", so how about "this means that the assertion was derived by applying statistical and machine learning models to clinical data such as EHR data, survey data, etc" or (shorter version) "this means that the assertion was derived from clinical data such as EHR data, survey data, etc".
- Will Biolink 2.2.1 be in use consortium-wide for the Sept/Dec demo?
Thanks!
Hi @karafecho :) - yep happy to go with whichever description the group provides. We discussed 2.2.1 in the DM call today and decided to go up to 2.2.1 in the demos. From now on, however, everyone will stay on 2.2.x -- any bug fixes needed for the demo will go in the 2.2.x release space (can go into more detail offline if you want :)).
@sierra-moxon : The committee voted on a definition for biolink:has_real_world_evidence_of_association_with and landed on this one "this means that the assertion was derived by applying statistical and machine learning models to clinical data such as EHR data, survey data, etc" Thanks!
Note that this issue can be closed after you pull the definition.
"this means that the assertion was derived by applying statistical and machine learning models to clinical data such as EHR data, survey data, etc"
Thanks for the definition. This makes it more clear to me what the term 'real world evidence' is being used to represent, and points to a path forward for modeling this using qualifiers / edge properties in future iterations. For now, a couple quick questions / suggestions:
Thanks for your comments, Matt.
WRT (1), yes, you are correct that this is an OR situation.
WRT (2), I think we can stick with clinical data, as the committee had no concerns about that being too restrictive, plus the definition includes "etc", so I think we're fine.
Thanks Kara. re (2), as written the "etc" implies other types of clinical data besides those explicitly listed. Not types of data besides clinical data. If you all agree that this is OK, that is fine.
Just pointing it out b/c I do think we will want this idea of 'real word-evidence based associations' to cover correlations/predictions based on non-clinical instance data from things like omics studies or high-throughput cell-based analyses of drug response - which I believe the Multiomics Team and other sources are/will provide. Happy to leave this be until after the demo, but I am curious what Mutiomics KP folks have to say here (@gloriachin, @jh111).
Thanks for asking Matt. I agree with Kara, it is fine to note clinical data for now. Only Multiomic EHR KP is using this at this time. Multiomics Big GIM and Multiomics Wellness are using other predicates. This is a temporary predicate that will be better address by multiple qualifiers in the future, so there's no one way to get address all issues with this predicate at the time. Note: we'll remove it in the future once we have qualifies, and after we close this, we may want to add another postponed issue to deprecate this predicate.
Small detail: It should be the originally suggested statistical and machine learning models, given these both overlap. For example, logistic regression can be both.
Let's just stick with the definition that the committee approved, as documented in the meeting minutes. The OR vs AND issue will become a mute point after we move to a more elegant modeling solution.
Thanks, all!
@karafecho is this ready to close?
Yes, I believe so.
Note, since this new predicate is in v2.2 and KPs/ARAs have just moved to v2.1, it appears we'll want to wait to update our relevant queries for workflows B and C. I already changed query C.1 to use the new predicate (the biolink version difference was an oversight on my part) and now we're getting 0 results. I'll have to revert to related_to for now. Anyone have an idea of timeframe for moving to biolink v2.2?
Ryan: As we discussed during today's mini-hackathon, I believe all ARAs are fine with moving to Biolink 2.2.1 (or at least adopting the new predicate), so we should be able to stick with the original plan for Workflows B and C. Thanks!
Workflow C is using a temporary workaround: querying an explicit subset of one or more KPs. This works, but it extra steps and complexity to the demo, and communication about the demo. It's better to query a predicate and get results from any Translator KP that supports that predicate.
We need to find or add an appropriate biolink predicate and add it to the KPs.
Multiomics EHR KP uses supervised machine learning and creates two types of edges.