NCATSTranslator / minihackathons

MIT License
5 stars 5 forks source link

Workflow C needs a predicate #47

Closed jh111 closed 2 years ago

jh111 commented 3 years ago

Workflow C is using a temporary workaround: querying an explicit subset of one or more KPs. This works, but it extra steps and complexity to the demo, and communication about the demo. It's better to query a predicate and get results from any Translator KP that supports that predicate.

We need to find or add an appropriate biolink predicate and add it to the KPs.

Multiomics EHR KP uses supervised machine learning and creates two types of edges.

jh111 commented 3 years ago

@karafecho

karafecho commented 3 years ago

@jh111 : What is the edge that COHD is using?

CaseyTa commented 3 years ago

Sorry, to clarify, COHD is still using the biolink:correlated_with predicate, but what we're doing is shoving more information into the edge attributes. Originally, we would use the non-standard query_options in the TRAPI query to indicate which type of association metric to calculate. Now, we just calculate all the association metrics (chi-square, relative frequency, and observed-expected frequency ratio) and return them all on the edge attributes. Example below.

However, we haven't determined exactly what attribute_type_ids to use yet, so we just have some placeholders in for now until we receive additional guidance. Since the attribute_type_ids are currently not unique or very specific, we will need to figure these out before they can be useful.

Using the relative_frequency_subject or relative_frequency_object attributes (original_attribute_name), we can find drugs that have a high proportion of patients with a given disease. Currently, the client would have to filter through the edges to find the high values, but in the future, edge constraints could potentially be used to do this in COHD.

            "edges": {
                "ke000000": {
                    "attributes": [
                        {
                            "attribute_source": "COHD",
                            "attribute_type_id": "biolink:p_value",
                            "description": "Chi-square p-value, unadjusted. http://cohd.io/about.html",
                            "original_attribute_name": "p-value",
                            "value": 1.1969745232677346e-126,
                            "value_type_id": "EDAM:data_1669",
                            "value_url": "http://edamontology.org/data_1669"
                        },
                        {
                            "attribute_source": "COHD",
                            "attribute_type_id": "biolink:p_value",
                            "description": "Chi-square p-value, Bonferonni adjusted by number of pairs of concepts. http://cohd.io/about.html",
                            "original_attribute_name": "p-value adjusted",
                            "value": 6.998590340094117e-122,
                            "value_type_id": "EDAM:data_1669",
                            "value_url": "http://edamontology.org/data_1669"
                        },
                        {
                            "attribute_source": "COHD",
                            "attribute_type_id": "biolink:has_evidence",
                            "description": "Observed-expected frequency ratio. http://cohd.io/about.html",
                            "original_attribute_name": "ln_ratio",
                            "value": 3.653252276470785,
                            "value_type_id": "EDAM:data_1772"
                        },
                        {
                            "attribute_source": "COHD",
                            "attribute_type_id": "biolink:has_confidence_level",
                            "description": "Observed-expected frequency ratio 0.99% confidence interval",
                            "original_attribute_name": "ln_ratio_confidence_interval",
                            "value": [
                                2.3314964364884654,
                                4.312497905355049
                            ],
                            "value_type_id": "EDAM:data_0951"
                        },
                        {
                            "attribute_source": "COHD",
                            "attribute_type_id": "biolink:has_evidence",
                            "description": "Relative frequency, relative to the subject node. http://cohd.io/about.html",
                            "original_attribute_name": "relative_frequency_subject",
                            "value": 0.1,
                            "value_type_id": "EDAM:data_1772"
                        },
                        {
                            "attribute_source": "COHD",
                            "attribute_type_id": "biolink:has_confidence_level",
                            "description": "Relative frequency (subject) 0.99% confidence interval",
                            "original_attribute_name": "relative_freq_subject_confidence_interval",
                            "value": [
                                0.020833333333333332,
                                0.26126126126126126
                            ],
                            "value_type_id": "EDAM:data_0951"
                        },
                        {
                            "attribute_source": "COHD",
                            "attribute_type_id": "biolink:has_evidence",
                            "description": "Relative frequency, relative to the object node. http://cohd.io/about.html",
                            "original_attribute_name": "relative_frequency_object",
                            "value": 1.5,
                            "value_type_id": "EDAM:data_1772"
                        },
                        {
                            "attribute_source": "COHD",
                            "attribute_type_id": "biolink:has_confidence_level",
                            "description": "Relative frequency (object) 0.99% confidence interval",
                            "original_attribute_name": "relative_freq_object_confidence_interval",
                            "value": [
                                0.18181818181818182,
                                14.5
                            ],
                            "value_type_id": "EDAM:data_0951"
                        },
                        {
                            "attribute_source": "COHD",
                            "attribute_type_id": "biolink:has_count",
                            "description": "Observed concept count between the pair of subject and object nodes",
                            "original_attribute_name": "concept_pair_count",
                            "value": 15,
                            "value_type_id": "EDAM:data_0006"
                        },
                        {
                            "attribute_source": "COHD",
                            "attribute_type_id": "biolink:has_count",
                            "description": "Observed concept count of the subject node",
                            "original_attribute_name": "concept_count_subject",
                            "value": 150,
                            "value_type_id": "EDAM:data_0006"
                        },
                        {
                            "attribute_source": "COHD",
                            "attribute_type_id": "biolink:has_count",
                            "description": "Observed concept count of the object node",
                            "original_attribute_name": "concept_count_object",
                            "value": 10,
                            "value_type_id": "EDAM:data_0006"
                        },
                        {
                            "attribute_source": "COHD",
                            "attribute_type_id": "EDAM:operation_3438",
                            "description": "Calculated expected count of concept pair. For ln_ratio. http://cohd.io/about.html",
                            "original_attribute_name": "expected_count",
                            "value": 0.38860103626943004,
                            "value_type_id": "EDAM:operation_3438"
                        },
                        {
                            "attribute_source": "COHD",
                            "attribute_type_id": "biolink:provided_by",
                            "description": "Dataset ID within COHD",
                            "original_attribute_name": "dataset_id",
                            "value": 3,
                            "value_type_id": "EDAM:data_1048"
                        }
                    ],
                    "object": "UMLS:C4039003",
                    "predicate": "biolink:correlated_with",
                    "subject": "MONDO:0021113"
                }
            }
karafecho commented 3 years ago

Looping in @xu-hao...

ICEES is also using the predicate biolink:correlated_with and treating edge attributes similar to COHD. ICEES returns frequencies, chi square statistics, and p values (with and without corrections for multiple comparison) for drug-disease associations, which could be used to rank the associations.

One limitation of ICEES is the finite set of diseases that are represented. For instance, we currently do not expose data on patients with multiple sclerosis. So, for ICEES to be able to respond to a Workflow C query, the choice of disease will be important.

CaseyTa commented 3 years ago

FYI, Matt said he'll start working on edge attributes next week and asked us to update our examples in this spreadsheet. Just FYI in case you all have new attributes you want to update in the spreadsheet also.

I am on vacation this week, but now that the source retrieval provenance modeling work is wrapping up, am set to tackle evidence-related edge metadata wen I return next week. My plan is to use the attribute examples in the 'edge attributes' sheet of the 'attribute_types' spreadsheet that the TRAPI team collected a couple months back as the initial set of requirements to try and support. https://docs.google.com/spreadsheets/d/1-ilDWePMLniA9Tha5J6HHHFylO5w6lZ9FFfA2Mp5oro/edit#gid=0. I have already begun mapping many of the entries in this sheet to proposed biolink properties (as you can see in my curation/notes in columns G-K. I see that the rows entered by COHD (starting at row 104) cover much of what you shared above, but there are some differences. If you would be so kinds as to update the COHD rows to reflect the current set of edge attributes you report above, that would be great. (Columns A-F are those that the KPs are to fill in. Columns G and after are my own notes/curation). (edited) Specifically, if you all have defined are new attribute_types not in the sheet, add a row for them. And if there are attributes in the sheet you no longer use, strike through the text (Alt+shift+5).

jh111 commented 3 years ago

Adding @rtroper.

vgardner-renci commented 3 years ago

Will be discussed tomorrow in Clinical Data committee @karafecho

CaseyTa commented 3 years ago

Here's a quick sample of what information may come in if we look at the relative frequency data coming from COHD. Only "strong-ish" hit from COHD is Cromolyn Sodium 20 MG/ML Oral Solution. The meaning of this metric is essentially telling us that among patients observed with Cromolyn Sodium 20 MG/ML Oral Solution, 20% of them have Ehlers-Danlos. Although the odd thing is that this finding may be specific to this dose/formulation of Cromolyn, as Cromolyn in general has 1.4% relative frequency.

image

CaseyTa commented 3 years ago

For reference, here are some results that would come in for the basic biolink:correlated_with query (the score is the log ratio column)

image

rtroper commented 3 years ago

It's long, but what about something like: has_real_world_evidence_of_association_with or has_clinical_evidence_of_association_with?

CaseyTa commented 3 years ago

@rtroper I slightly prefer the has_real_world_evidence variant over has_clinical_evidence since this allows expansion to other forms of RWE in the future.

However, @jh111, were you hoping that TextMiner could use the same predicate? If so, then we'd probably have to drop RWE.

If we don't need to share a predicate with TextMiner, what if we use either an abstract parent predicate has_real_world_evidence or a mixin for real_world_evidence and then have just have biolink:associated_with, biolink:positively_associated_with, and biolink:negatively_associated_with under that? Would make the predicate much shorter, but perhaps makes it non-obvious that it's specific to RWE unless people are already familiar with it. After scrolling through the model, it looks like Biolink doesn't shy away from long names, so perhaps the full biolink:has_real_world_evidence_of_association_with is better.

All clinical KPs could return on the same predicate, but we can also potentially ask the data modeling team to mint new association slots that we can use for edge attributes which distinguish the data from the various clinical KPs. For example, something like biolink:has_chi_square_p_value, biolink:has_ln_ratio, biolink:has_relative_frequency, and biolink:has_logistic_regression_coefficient.

rtroper commented 3 years ago

For the record, there was consensus in our 7/16 clinical WG meeting to go with has_real_world_evidence_of_association_with. @jh111 since you weren't in the meeting, let us know if you have any objections. We plan on inviting Matt Brush to the next clinical WG meeting to talk about adding this predicate to the biolink model.

karafecho commented 3 years ago

+1

Note that Matt B. has confirmed attendance at our next clinical data committee meeting on 7/23. I plan to invite Sierra M., too.

sierra-moxon commented 3 years ago

Hi @rtroper @karafecho @jh111 @mikebada @mbrush :) - We discussed this in the predicates working group today too. We were hoping to surface our discussions at the DM call next week (or the following week).

In particular, we've been operating under the following guidelines to classify something as a predicate vs. as an edge qualifier (aka: edge property): 1) Predicates would be the "what you know" about an association. 2) Qualifiers of the edge (aka: edge properties, aka association slots) would be the "how you know" about an association.

So in this case, we would be prompted to make an "associated with" predicate (though we need to have a clear definition for "associated with" if it is different than "related to" and different than "correlated with"), and an epistemic qualifier "supported by real-world evidence" (and another, "supported by clinical evidence" if needed).

The hierarchy of predicates for this example would be something like this:

Associated with (needs definition):
Correlated with (statistical dependence): Negatively correlated with (mixin for multiple parentages) Positively correlated with (mixin) Negatively associated with: Negatively correlated with (mixin for multiple parentages) Positively associated with: Positively correlated with (mixin)

Any of these predicates could have the epistemic qualifier 'supported by real-world evidence, etc.

Would you be willing to come up with definitions for "associated with" (as well as "negatively associated with" and "positively associated with")?

related to
correlated with

karafecho commented 3 years ago

Looping in @CaseyTa and @xu-hao ...

CaseyTa commented 3 years ago

Thanks @sierra-moxon! I think going with an additional epistemic qualifier on the existing biolink:correlated_with or on a new biolink:associated_with predicate could work for COHD.

How would this look in TRAPI? Would this be an edge_attribute with something like attribute_type_id biolink:supported_by_real-world_evidence" and valueTrue`? If so, then we would likely need to have TRAPI edge constraints implemented by all KPs in order to target a query towards clinical data KPs.

Is there / could there be an epistemic qualifier to capture the notion of "predicted by a machine learning model" also? Or would you recommend encoding this in some other manner? This could be of interest for OpenPredict (predicting drug treats disease relationships from a variety of vector embedding sources) and EHR Risk Provider. Can multiple qualifiers be True at the same time? For example, both "predicted by a machine learning model" and "supported by RWE" if the ML prediction was performed on real-world data?

Regarding the definition of has_real_world_evidence_of_association_with, I don't have a precise definition, but in my mind, I was viewing it somewhere between related_to and correlated_with: below related_to because this edge is supported by real-world data, and above correlated_with since correlated_with doesn't very accurately capture the semantics from EHR Risk Provider's logistic regression model. I'm not sure if that's how others were also interpreting it, so please chime in.

Instead of a new associated with predicate, we could potentially mint a new predicate that's more meaningful for EHR Risk Provider while ICEES and COHD continue to use correlated_with. We could potentially target a query to all clinical KPs by using related_to in conjunction with the supported_by_real-world_evidence qualifier, but again, this risks pulling in data from many KPs unless all KPs implement constraints in time. I'm very open to other suggestions

rtroper commented 3 years ago

I also like the idea of using additional qualifiers on the base edge predicate. This seems like a good way to avoid the proliferation of new predicates for every nuanced variation one can think of. I have similar questions to Casey: Would this be implemented as an additional edge attribute, and would KPs and ARAs be able to implement/support this in the relatively near future?

As for defining a new, fairly generic edge predicate like associated_with or using an existing predicate like related_to, I might lean, for now, toward just using related_to. This is very broad, but with the qualifier, it might be okay. Then again, there's the point that Casey made about how soon KPs could implement support for the qualifier.

For now, I don't have a definition of associated_with that would clearly distinguish it from related_to, but I'll give it some thought. I'll also think about other possible predicates.

sierra-moxon commented 3 years ago

@CaseyTa @rtroper these are great questions! We are definitely walking a bit of a line here between a boolean qualifier that just says 'supported by real-world evidence' and actually quantifying that evidence (possibly with p-values, etc). Are these two use cases (the one where we want to report the statistical edge properties like p-value, and the one where we want to classify an edge as 'real-world' vs. 'clincial') independent? Will be good to hash through these in the meetings this week. :)

karafecho commented 3 years ago

I agree with the comments posted by Casey and Ryan.

A couple of additional points:

  1. Is there / could there be an epistemic qualifier to capture the notion of "predicted by a machine learning model" also? Or would you recommend encoding this in some other manner? This could be of interest for OpenPredict (predicting drug treats disease relationships from a variety of vector embedding sources) and EHR Risk Provider.
  1. In terms of a definition of associated_with, I also don't have a clear definition, just a mental model that is very similar to Casey's.

  2. In my mind, the evidence on statistical properties is independent of the qualifier supported_by_real_world_evidence.

tursynay commented 3 years ago

Clin Data Committee discussing 7/23, please update after this discussion. Sierra and Matt will be attending this as well

mikebada commented 3 years ago

We've already discussed having a predicted qualifier, so I think creating and using a subsumed predicted_by_machine_learning_model qualifier (and perhaps other subtypes) would be fine if people would find that useful...

If these qualifiers are implemented as booleans, as @CaseyTa brought up, we could name these, e.g., is_predicted, is_predicted_by_machine_learning_model, is_supported_by_real_world_evidence, which could be assigned values of true as needed.

We've also briefly discussed folding these qualifiers into the larger suite of EPC attributes.

tursynay commented 3 years ago

More discussions to follow at the Clin Data Committee, not done

karafecho commented 3 years ago

This issue was discussed with Sierra M. and Matt B. during last week's clinical data committee meeting. Will also be discussed during today's DM call and tomorrow's clinical data committee meeting. Sierra will be putting forward a proposal with several options for resolving this issue.

sierra-moxon commented 2 years ago

Proposal (hierarchy below):

    related_to
       correlated_with
       has_real_world_evidence_of_association_with

TODO: refine the description of this predicate so that it conveys temporary status and how we should use it to return/minimize actionable results for workflow C and B.

karafecho commented 2 years ago

Note that the Clinical Data Committee reached consensus on adoption of the new predicate/hierarchy by way of informal vote during today's meeting, during which time only a subset of committee members was present. However, given the timeline for the December demo, as well as the fact that this is a temporary fix, I think that an informal vote is sufficient, but please let me know if others disagree.

sierra-moxon commented 2 years ago

` has real world evidence of association with: is_a: related to description: >- this suggests the person has the disease in combination with other triples that use this predicate in_subset:

added in biolink model release 2.2.1 (for temporary use - will be refactored in later releases).

karafecho commented 2 years ago

@sierra-moxon: I have a couple of questions.

  1. I'm not sure I like the proposed description, as it suggests a specific analytic approach to define the assertion, but it does not provide sufficient details to be useful, imho. I think the original intent was to emphasize the "real-world evidence", so how about "this means that the assertion was derived by applying statistical and machine learning models to clinical data such as EHR data, survey data, etc" or (shorter version) "this means that the assertion was derived from clinical data such as EHR data, survey data, etc".
  2. Will Biolink 2.2.1 be in use consortium-wide for the Sept/Dec demo?

Thanks!

tursynay commented 2 years ago

predicate needed for the workflow C to run is dependent on the qualifiers that the DM team is working on. Decision for now is a stopgap predicate that will work for this case. This predicate has been added to Biolink. However these predicates are in the view version of BL. So we need to move to the new version of BL 2.2.1. Note, this version has some other changes. Possible to issue BL 2.1.1 and have these predicates be part of that if we are worried about 2.2.1 breaking things. @vgardner-renci to add this topic to the Architecture agenda.

karafecho commented 2 years ago

Please note that this issue applies to both Workflow B and Workflow C, even though it originated with Workflow C. Also, please note that we need to resolve the definition of the new predicate, in addition to the versioning issues.

sierra-moxon commented 2 years ago

@sierra-moxon: I have a couple of questions.

  1. I'm not sure I like the proposed description, as it suggests a specific analytic approach to define the assertion, but it does not provide sufficient details to be useful, imho. I think the original intent was to emphasize the "real-world evidence", so how about "this means that the assertion was derived by applying statistical and machine learning models to clinical data such as EHR data, survey data, etc" or (shorter version) "this means that the assertion was derived from clinical data such as EHR data, survey data, etc".
  2. Will Biolink 2.2.1 be in use consortium-wide for the Sept/Dec demo?

Thanks!

Hi @karafecho :) - yep happy to go with whichever description the group provides. We discussed 2.2.1 in the DM call today and decided to go up to 2.2.1 in the demos. From now on, however, everyone will stay on 2.2.x -- any bug fixes needed for the demo will go in the 2.2.x release space (can go into more detail offline if you want :)).

karafecho commented 2 years ago

@sierra-moxon : The committee voted on a definition for biolink:has_real_world_evidence_of_association_with and landed on this one "this means that the assertion was derived by applying statistical and machine learning models to clinical data such as EHR data, survey data, etc" Thanks!

Note that this issue can be closed after you pull the definition.

mbrush commented 2 years ago

"this means that the assertion was derived by applying statistical and machine learning models to clinical data such as EHR data, survey data, etc"

Thanks for the definition. This makes it more clear to me what the term 'real world evidence' is being used to represent, and points to a path forward for modeling this using qualifiers / edge properties in future iterations. For now, a couple quick questions / suggestions:

  1. Replace "statistical and machine learning models" with "statistical methods or machine learning models" (I assume this is an OR situation, not AND)
  2. Do you want to limit the supporting data to clinical data? Does this preclude the predicate from being used for some of the Multiomics associations that are based on omics / HTP screening data rather than clinical/EHR data?
karafecho commented 2 years ago

Thanks for your comments, Matt.

WRT (1), yes, you are correct that this is an OR situation.

WRT (2), I think we can stick with clinical data, as the committee had no concerns about that being too restrictive, plus the definition includes "etc", so I think we're fine.

mbrush commented 2 years ago

Thanks Kara. re (2), as written the "etc" implies other types of clinical data besides those explicitly listed. Not types of data besides clinical data. If you all agree that this is OK, that is fine.

Just pointing it out b/c I do think we will want this idea of 'real word-evidence based associations' to cover correlations/predictions based on non-clinical instance data from things like omics studies or high-throughput cell-based analyses of drug response - which I believe the Multiomics Team and other sources are/will provide. Happy to leave this be until after the demo, but I am curious what Mutiomics KP folks have to say here (@gloriachin, @jh111).

jh111 commented 2 years ago

Thanks for asking Matt. I agree with Kara, it is fine to note clinical data for now. Only Multiomic EHR KP is using this at this time. Multiomics Big GIM and Multiomics Wellness are using other predicates. This is a temporary predicate that will be better address by multiple qualifiers in the future, so there's no one way to get address all issues with this predicate at the time. Note: we'll remove it in the future once we have qualifies, and after we close this, we may want to add another postponed issue to deprecate this predicate.

Small detail: It should be the originally suggested statistical and machine learning models, given these both overlap. For example, logistic regression can be both.

karafecho commented 2 years ago

Let's just stick with the definition that the committee approved, as documented in the meeting minutes. The OR vs AND issue will become a mute point after we move to a more elegant modeling solution.

Thanks, all!

vgardner-renci commented 2 years ago

@karafecho is this ready to close?

karafecho commented 2 years ago

Yes, I believe so.

rtroper commented 2 years ago

Note, since this new predicate is in v2.2 and KPs/ARAs have just moved to v2.1, it appears we'll want to wait to update our relevant queries for workflows B and C. I already changed query C.1 to use the new predicate (the biolink version difference was an oversight on my part) and now we're getting 0 results. I'll have to revert to related_to for now. Anyone have an idea of timeframe for moving to biolink v2.2?

karafecho commented 2 years ago

Ryan: As we discussed during today's mini-hackathon, I believe all ARAs are fine with moving to Biolink 2.2.1 (or at least adopting the new predicate), so we should be able to stick with the original plan for Workflows B and C. Thanks!