biolink / biolink-model

Schema and generated objects for biolink data model and upper ontology
https://biolink.github.io/biolink-model/
Other
170 stars 71 forks source link

supporting study results for COHD #1074

Closed CaseyTa closed 2 years ago

CaseyTa commented 2 years ago

Is your feature request related to a problem? Please describe. Matt B. helped us to develop a 2-level supporting study result structure to model COHD data, but most of the proposed elements are not defined in Biolink yet. Google doc describing the model. Matt's examples (see the example set at the bottom)

What working group (or team) did this request originate from? Clinical Data Services (COHD)

Note: This is relevant for members of NCATS Translator.

Describe the solution you'd like The following may need to be defined in Biolink:

Additional information to support this request (optional) We've recently implemented this model in the COHD's dev endpoint at https://cohd.io/api/query and we'd like to push these changes out to the ITRB environments soon (ideally before the Aug 29 code freeze). From our perspective, we're not in a rush to have these defined immediately in Biolink as it appears that reasoners are still able to utilize the data (e.g., ARAX UI is able to display the results). We are open to guidance on how to move forward with this.

Sample query that can be sent to COHD to see results:

{
    "message": {
        "query_graph": {
            "nodes": {
                "subj": {
                    "ids": ["MONDO:0009061"]
                },
                "obj": {
                    "categories": ["biolink:DiseaseOrPhenotypicFeature"]
                }
            },
            "edges": {
                "e0": {
                    "subject": "subj",
                    "object": "obj",
                    "predicates": ["biolink:has_real_world_evidence_of_association_with"]
                }
            }
        }
    }
}

Sample attributes from a single edge:

                    "attributes": [
                        {
                            "attribute_source": "infores:cohd",
                            "attribute_type_id": "biolink:original_knowledge_source",
                            "description": "The COHD KP defines associations between biomedical concepts based on statistical analysis of clinical/EHR data.",
                            "value": "infores:cohd",
                            "value_type_id": "biolink:InformationResource",
                            "value_url": "http://cohd.io/api/query"
                        },
                        {
                            "attribute_source": "infores:cohd",
                            "attribute_type_id": "biolink:supporting_dataset",
                            "description": "Dataset ID within COHD",
                            "original_attribute_name": "dataset_id",
                            "value": "COHD:dataset_1",
                            "value_type_id": "EDAM:data_1048"
                        },
                        {
                            "attribute_source": "infores:cohd",
                            "attribute_type_id": "biolink:supporting_study_result",
                            "attributes": [
                                {
                                    "attribute_source": "infores:cohd",
                                    "attribute_type_id": "biolink:concept_pair_count",
                                    "description": "Observed concept count between the pair of subject and object nodes",
                                    "original_attribute_name": "concept_pair_count",
                                    "value": 449,
                                    "value_type_id": "EDAM:data_0006"
                                },
                                {
                                    "attribute_source": "infores:cohd",
                                    "attribute_type_id": "biolink:concept_count_subject",
                                    "description": "Observed concept count of the subject node",
                                    "original_attribute_name": "concept_count_subject",
                                    "value": 968,
                                    "value_type_id": "EDAM:data_0006"
                                },
                                {
                                    "attribute_source": "infores:cohd",
                                    "attribute_type_id": "biolink:concept_count_object",
                                    "description": "Observed concept count of the object node",
                                    "original_attribute_name": "concept_count_object",
                                    "value": 556,
                                    "value_type_id": "EDAM:data_0006"
                                },
                                {
                                    "attribute_source": "infores:cohd",
                                    "attribute_type_id": "biolink:supporting_dataset",
                                    "description": "Dataset ID within COHD",
                                    "original_attribute_name": "dataset_id",
                                    "value": "COHD:dataset_1",
                                    "value_type_id": "EDAM:data_1048"
                                }
                            ],
                            "description": "A study result describing the initial count of concepts",
                            "value": null,
                            "value_type_id": "biolink:ConceptCountAnalysisResult"
                        },
                        {
                            "attribute_source": "infores:cohd",
                            "attribute_type_id": "biolink:supporting_study_result",
                            "attributes": [
                                {
                                    "attribute_source": "infores:cohd",
                                    "attribute_type_id": "biolink:unadjusted_p-value",
                                    "description": "Chi-square p-value, unadjusted. http://cohd.io/about.html",
                                    "original_attribute_name": "p-value",
                                    "value": 1e-12,
                                    "value_type_id": "EDAM:data_1669",
                                    "value_url": "http://edamontology.org/data_1669"
                                },
                                {
                                    "attribute_source": "infores:cohd",
                                    "attribute_type_id": "biolink:bonferonni_adjusted_p-value",
                                    "description": "Chi-square p-value, Bonferonni adjusted by number of pairs of concepts. http://cohd.io/about.html",
                                    "original_attribute_name": "p-value adjusted",
                                    "value": 1e-12,
                                    "value_type_id": "EDAM:data_1669",
                                    "value_url": "http://edamontology.org/data_1669"
                                },
                                {
                                    "attribute_source": "infores:cohd",
                                    "attribute_type_id": "biolink:supporting_dataset",
                                    "description": "Dataset ID within COHD",
                                    "original_attribute_name": "dataset_id",
                                    "value": "COHD:dataset_1",
                                    "value_type_id": "EDAM:data_1048"
                                }
                            ],
                            "description": "A study result describing a chi-squared analysis on a single pair of concepts",
                            "value": null,
                            "value_type_id": "biolink:ChiSquaredAnalysisResult"
                        },
                        {
                            "attribute_source": "infores:cohd",
                            "attribute_type_id": "biolink:supporting_study_result",
                            "attributes": [
                                {
                                    "attribute_source": "infores:cohd",
                                    "attribute_type_id": "biolink:expected_count",
                                    "description": "Calculated expected count of concept pair. For ln_ratio. http://cohd.io/about.html",
                                    "original_attribute_name": "expected_count",
                                    "value": 0.3006024806317585,
                                    "value_type_id": "EDAM:operation_3438"
                                },
                                {
                                    "attribute_source": "infores:cohd",
                                    "attribute_type_id": "biolink:ln_ratio",
                                    "description": "Observed-expected frequency ratio. http://cohd.io/about.html",
                                    "original_attribute_name": "ln_ratio",
                                    "value": 7.308989437171575,
                                    "value_type_id": "EDAM:data_1772"
                                },
                                {
                                    "attribute_source": "infores:cohd",
                                    "attribute_type_id": "biolink:ln_ratio_99_confidence_interval",
                                    "description": "Observed-expected frequency ratio 0.99% confidence interval",
                                    "original_attribute_name": "ln_ratio_confidence_interval",
                                    "value": [
                                        7.1447659245560216,
                                        7.455795361004793
                                    ],
                                    "value_type_id": "EDAM:data_0951"
                                },
                                {
                                    "attribute_source": "infores:cohd",
                                    "attribute_type_id": "biolink:supporting_dataset",
                                    "description": "Dataset ID within COHD",
                                    "original_attribute_name": "dataset_id",
                                    "value": "COHD:dataset_1",
                                    "value_type_id": "EDAM:data_1048"
                                }
                            ],
                            "description": "A study result describing an observed-expected frequency anaylsis on a single pair of concepts",
                            "value": null,
                            "value_type_id": "biolink:Observed-ExpectedFrequencyAnalysisResult"
                        },
                        {
                            "attribute_source": "infores:cohd",
                            "attribute_type_id": "biolink:supporting_study_result",
                            "attributes": [
                                {
                                    "attribute_source": "infores:cohd",
                                    "attribute_type_id": "biolink:relative_frequency_subject",
                                    "description": "Relative frequency, relative to the subject node. http://cohd.io/about.html",
                                    "original_attribute_name": "relative_frequency_subject",
                                    "value": 0.46384297520661155,
                                    "value_type_id": "EDAM:data_1772"
                                },
                                {
                                    "attribute_source": "infores:cohd",
                                    "attribute_type_id": "biolink:relative_freq_subject_confidence_interval",
                                    "description": "Relative frequency (subject) 0.99% confidence interval",
                                    "original_attribute_name": "relative_freq_subject_confidence_interval",
                                    "value": [
                                        0.3765490943755958,
                                        0.5680539932508436
                                    ],
                                    "value_type_id": "EDAM:data_0951"
                                },
                                {
                                    "attribute_source": "infores:cohd",
                                    "attribute_type_id": "biolink:relative_frequency_object",
                                    "description": "Relative frequency, relative to the object node. http://cohd.io/about.html",
                                    "original_attribute_name": "relative_frequency_object",
                                    "value": 0.8075539568345323,
                                    "value_type_id": "EDAM:data_1772"
                                },
                                {
                                    "attribute_source": "infores:cohd",
                                    "attribute_type_id": "biolink:relative_freq_object_confidence_interval",
                                    "description": "Relative frequency (object) 0.99% confidence interval",
                                    "original_attribute_name": "relative_freq_object_confidence_interval",
                                    "value": [
                                        0.63915857605178,
                                        1.0181451612903225
                                    ],
                                    "value_type_id": "EDAM:data_0951"
                                },
                                {
                                    "attribute_source": "infores:cohd",
                                    "attribute_type_id": "biolink:supporting_dataset",
                                    "description": "Dataset ID within COHD",
                                    "original_attribute_name": "dataset_id",
                                    "value": "COHD:dataset_1",
                                    "value_type_id": "EDAM:data_1048"
                                }
                            ],
                            "description": "A study result describing a relative frequency anaylsis on a single pair of concepts",
                            "value": null,
                            "value_type_id": "biolink:RelativeFrequencyAnalysisResult"
                        }
                    ]

Also, Text Mining Provider uses a similar structure and may need similar support. See the Google doc linked above, and Bill's slides from the KP of the Month presentation.

Tag relevant members for discussion @mbrush @bill-baumgartner

sierra-moxon commented 2 years ago

Some questions/comments I had while implementing this alongside the Text Mining Result and the supporting data properties in this PR:

@CaseyTa - would you mind reviewing the list above and just adding a "check" to all those that you agree with? I would also be very happy for updated descriptions/other review. :)

CaseyTa commented 2 years ago

Thanks @sierra-moxon!

I don't think I have the ability to check off the task list, so I'll comment below

Thanks, again!

mbrush commented 2 years ago

"supporting dataset" could be an alias of "supporting data source"? Yes, although I'm okay with just using "supporting data source" if that's already available

I'd be careful here. We make a distinction between data 'sources' (things that have infores ids, and are resources from which things are retrieved), and data 'sets' (simply a set of data - not a larger system / resource that provides a data set, record, etc). I think the data sets referenced in COHD Associations are simply datasets - and i have seen identifiers for them in your data. These datasets maybe served by some larger information resource - but you want to reference a specific data set if I understand correctly.

Also, consider that in the upcoming refactor of source retrieval provenance, it is likely that knowledge source properties like 'has supporting data source' will be deprecated. Instead, we will capture source provenance using a dedicated Attribute-like object, which will have a field to capture the role of a given source using a term from an enumeration (e.g. 'aggregator source', 'supporting data source'). So we will likely no longer use edge property like 'has aggregator source', or 'has supporting data source' in our representation of knowledge / data sources.

CaseyTa commented 2 years ago

Thanks, @mbrush. Yes, we were using supporting dataset within COHD to reference the various datasets that can be analyzed within the COHD KP (information resource). Your recommendation to keep these two definitions distinct makes sense for our use case.

sierra-moxon commented 2 years ago

w/re to 'has supporting data source' - do you mean this predicate (it specifies the relationship between the association and the result class)?:

  has supporting study result:
    is_a: related to at instance level
    description: >-
      connects an association to an instance of supporting study result

this was taken from this diagram - is there a better predicate now?

Screen Shot 2022-08-26 at 2 13 11 PM
mbrush commented 2 years ago

@sierra-moxon no, I mean the 'supporting data source' property that is part of the retrieval provenance set of edge properties (see here). This property links an Association to an Infores.

This is not the same as a 'supporting dataset' - and that we should have a 'has supporting dataset' edge property to capture a dataset that provides data supporting an Association, or a StudyResult.

The 'has_supporting_study_result' property is different from both of these - and is used to connect an Association to an instance of a Study Result (As the definition says)