NCATSTranslator / minihackathons

MIT License
5 stars 5 forks source link

PMIDs in knowledge_source_attribute in Improving workflow A.1 #240

Open vgardner-renci opened 2 years ago

vgardner-renci commented 2 years ago

https://github.com/NCATSTranslator/minihackathons/blob/main/2021-12_demo/workflowA/A.1_RHOBTB2.json

https://arax.ncats.io/?source=ARS&id=a1c2f30e-8bfc-4407-ad5c-7d43d70f11eb

brettasmi commented 2 years ago

@sierra-moxon @mbrush , during the minihackathon this morning, I was directed to ask you about this.

I know we had a long discussion on Slack about this, but I guess that I haven't gotten this right just yet. I've attached the attributes field from one of the edges of one of the responses in question; the pmid is in the first attribute in the array:

"attributes": [
            {
              "attribute_source": "infores:spoke",
              "attribute_type_id": "biolink:primary_knowledge_source",
              "original_attribute_name": "pubmed",
              "value": "pmid:23779130",
              "value_type_id": "biolink:Article",
              "value_url": "https://pubmed.ncbi.nlm.nih.gov/23779130"
            },
            {
              "attribute_source": "infores:spoke",
              "attribute_type_id": "biolink:aggregator_knowledge_source",
              "original_attribute_name": "source",
              "value": "infores:civic",
              "value_type_id": "biolink:InformationResource"
            },
            {
              "attribute_source": "infores:spoke",
              "attribute_type_id": "biolink:aggregator_knowledge_source",
              "value": "infores:spoke",
              "value_type_id": "biolink:InformationResource"
            },
            {
              "attribute_source": "infores:improving-agent",
              "attribute_type_id": "biolink:aggregator_knowledge_source",
              "value": "infores:improving-agent",
              "value_type_id": "biolink:InformationResource"
            },
            {
              "attribute_source": "infores:civic",
              "attribute_type_id": "biolink:qualifiers",
              "original_attribute_name": "tier",
              "value": "D"
            },
            {
              "attribute_source": "infores:civic",
              "attribute_type_id": "biolink:qualifiers",
              "original_attribute_name": "variant",
              "value": "F877L"
            },
            {
              "attribute_source": "infores:civic",
              "attribute_type_id": "biolink:qualifiers",
              "original_attribute_name": "clin_sig",
              "value": "Resistance"
            },
            {
              "attribute_source": "infores:civic",
              "attribute_type_id": "biolink:iri",
              "original_attribute_name": "url",
              "value": "https://civic.genome.wustl.edu/links/evidence_items/447"
            }
          ]

If it's not clear, the provenance is imProving Agent <- SPOKE <- CIVIC <- pmid:23779130.

Would you mind commenting your thoughts on this?

cbizon commented 2 years ago

Specifically, I was under the impression that PMIDs were not knowledge sources, but handled some separate way. But I am not sure...

sierra-moxon commented 2 years ago

From the helpdesk this morning: CIVIC gives compound to gene edges. SPOKE aggregates this information from a publication associated with the CIVIC compound-to-gene assertion. Publication would be evidence for the entire compound-to-gene edge, not as evidence to a specific attribute of the edge.

CIVIC might be the 'primary knowledge source' and it could have a 'publications' attribute that showed the publication evidence that CIVIC/SPOKE provided. @mbrush and @sierra-moxon will document this here.

sierra-moxon commented 2 years ago

This example is saying attribute_source is “who curated the knowledge source property” In all cases, the knowledge_sources were provided by improving-agent, not spoke or civic.

“Improving agent is the one who said civic is a primary knowledge source”
“Improving agent is the one who said spoke is an aggregator knowledge source” “Improving agent is the one who said improving agent is an aggregator knowledge source” “CIVIC said this publication is evidence for this entire edge” "CIVIC said that tier, variant and _clinsig are evidence for this edge"


"attributes": [
            {
              "attribute_source": "infores:improving-agent",
              "attribute_type_id": "biolink:primary_knowledge_source",
          “value”: ”infores:civic”,
              "value_type_id": "biolink:InformationResource"    
            },
            {
              “attribute_source”: “infores:civic”,  
              "attribute_type_id": "biolink:publication",
              "original_attribute_name": "pubmed",
              "value": "pmid:23779130",
              "value_type_id": "biolink:Article",
              "value_url": "https://pubmed.ncbi.nlm.nih.gov/23779130"
            },
            {
              "attribute_source": "infores:improving-agent",
              "attribute_type_id": "biolink:aggregator_knowledge_source",
              "value": "infores:spoke",
              "value_type_id": "biolink:InformationResource"
            },
            {
              "attribute_source": "infores:improving-agent",
              "attribute_type_id": "biolink:aggregator_knowledge_source",
              "value": "infores:improving-agent",
              "value_type_id": "biolink:InformationResource"
            },
            {
              "attribute_source": "infores:civic",
              "attribute_type_id": "biolink:qualifiers",
              "original_attribute_name": "tier",
              "value": "D"
            },
            {
              "attribute_source": "infores:civic",
              "attribute_type_id": "biolink:qualifiers",
              "original_attribute_name": "variant",
              "value": "F877L"
            },
            {
              "attribute_source": "infores:civic",
              "attribute_type_id": "biolink:qualifiers",
              "original_attribute_name": "clin_sig",
              "value": "Resistance"
            },
            {
              "attribute_source": "infores:civic",
              "attribute_type_id": "biolink:iri",
              "original_attribute_name": "url",
              "value": "https://civic.genome.wustl.edu/links/evidence_items/447"
            }
          ]
brettasmi commented 2 years ago

Thanks @sierra-moxon for your replies, which have generated some new questions :)

  1. You wrote:

    SPOKE aggregates this information from a publication associated with the CIVIC compound-to-gene assertion.

This isn't quite right. SPOKE ingests CIViC, which has an evidence section for each of its assertions. The publication falls in this evidence section, e.g. https://civic.genome.wustl.edu/links/evidence_items/447

Does that change your interpretation at all?

  1. CIVIC might be the 'primary knowledge source'

As far as I know, CIVIC is "crowd-sourced" and doesn't generate knowledge except through curation, thus our use of "aggregator knowledge source." I'm not an expert on the data sources and their creation of the data so I may be wrong about their process. That said, I'm still a little fuzzy on understanding the definitions of the different knowledge source types. Does crowd-sourcing literature on variants mean aggregation or primary knowledge generation?

  1. In your example, you wrote:

    In all cases, the knowledge_sources were provided by improving-agent, not spoke or civic.

All of these attributes (except for the final imProving Agent aggregator) are found in SPOKE and would be returned by SPOKE-KP, thus they do come from SPOKE. Should the ARA overwrite these attribute sources, or was this just a simplification of your example?

Thanks!

sierra-moxon commented 2 years ago

Thanks @sierra-moxon for your replies, which have generated some new questions :)

  1. You wrote:

SPOKE aggregates this information from a publication associated with the CIVIC compound-to-gene assertion.

This isn't quite right. SPOKE ingests CIViC, which has an evidence section for each of its assertions. The publication falls in this evidence section, e.g. https://civic.genome.wustl.edu/links/evidence_items/447

Does that change your interpretation at all?

Yes, that makes sense, thanks for the link! :) Karthik wasn't sure if the publication was added via spoke or civic, but he took a look at the code and said that it was likely civic. So, in the example above, we used civic as the "attribute_source" for the actual publication 'attribute.'

  1. CIVIC might be the 'primary knowledge source'

As far as I know, CIVIC is "crowd-sourced" and doesn't generate knowledge except through curation, thus our use of "aggregator knowledge source." I'm not an expert on the data sources and their creation of the data so I may be wrong about their process. That said, I'm still a little fuzzy on understanding the definitions of the different knowledge source types. Does crowd-sourcing literature on variants mean aggregation or primary knowledge generation?

This is a good question. I am leaning towards "yes" -- perhaps we should also consider 'original knowledge source.' Since the curator was the one that attached this edge to the publication, I think that justifies the source that did that curation as a primary or original knowledge source (I probably wouldn't try to capture the kind of curation, crowdsourcing, etc. in this attribute), at least according to our definitions of the 'original knowledge source' attribute in the model:

  primary knowledge source:
     is_a: knowledge source
     description: >-
       The most upstream source of the knowledge expressed in an Association that an implementer can identify (may or may not be the 'original' source).
     range: information resource
     multivalued: false

and original knowledge source:

  original knowledge source:
    is_a: primary knowledge source
    description: >-
      The Information Resource that created the original record of the knowledge expressed
      in an Association (e.g. via curation of the knowledge from the literature, or
      generation of the knowledge de novo through computation, reasoning, inference over
      data).
    range: information resource
    multivalued: false
  1. In your example, you wrote:

In all cases, the knowledge_sources were provided by improving-agent, not spoke or civic.

All of these attributes (except for the final imProving Agent aggregator) are found in SPOKE and would be returned by SPOKE-KP, thus they do come from SPOKE. Should the ARA overwrite these attribute sources, or was this just a simplification of your example?

I think attribute_source is declaring: “who curated the knowledge source property itself”. If SPOKE has edge properties for 'aggregator knowledge source' already, then attribute_source would be SPOKE, yep. Else, if the only organization that is adding the '[x] knowledge source' property is improving-agent, then I think the example above captures it.

Thanks!

happy to meet about this -- we can add it as an agenda item in DM call (or in the mini-hackathon on Thursday)? :)

@mbrush

brettasmi commented 2 years ago

This is to be discussed at either the minihackathon or the data modeling call today.

mbrush commented 2 years ago

Hi @brettasmi @sierra-moxon

I agree with sierra's responses, and the latest set of attributes look right to me for the most part. One question and a few comments:

  1. Question: What are the plain language semantics of the assertions represented in Edges based on CIViC records? And what SPO structure is used to capture is this? I gather from above that relate a subject gene to an object chemical/drug? Using what predicate?

  2. Confirming that publications are not considered 'information resources' in the sense defined in Biolink, and thus do not belong in the 'knowledge source' properties. At present, there is a single biolink:publications property with broad semantics - that captures publications related in any way to an edge. We may refine the modeling here to be more precise - but for now this is the right property to use to say that a publication provided evidence for an assertion, or provided the assertion itself.

  3. I agree with the assessment that CIViC is the original knowledge source in this case - clearly fits with the definition of this property Sierra provided above.

  4. There is a proposed property called source_record_url that might be a better fit than biolink:iri, to capture the url of the source CIViC record. It is pending review and defined here

  5. re: the attributes using the biolink:qualifiers property to capture clinical significance, variant, and tier - as we finish defining our approach to use of qualifiers and modifies to capture richer association semantics, we will define more precise qualifier properties to use to capture these pieces of information, and the contribution they make to the knowledge asserted in an edge. I'd love to dig into the semantics of what these qualifiers hold on the next EPC call so we can refine the modeling here.

brettasmi commented 2 years ago

@mbrush to answer your question

Question: What are the plain language semantics of the assertions represented in Edges based on CIViC records? And what SPO structure is used to capture is this? I gather from above that relate a subject gene to an object chemical/drug? Using what predicate?

In plain language, this might be written as "This medication affects the expression of a mutant gene," where "mutant gene" refers to a specific variant that is specified in another attribute in the array above. This is currently being returned as SmallMolecule - process_regulates_process -> Gene which thus far is an uncaught bug, so thanks for making me check :). It should be SmallMolecule - entity_regulates_entity -> Gene or something more specific in the regulates hierarchy.

Thank you for your other comments!

brettasmi commented 2 years ago

Fixed.

Thanks to @sierra-moxon and @mbrush for all of the guidance!

mbrush commented 2 years ago

In plain language, this might be written as "This medication affects the expression of a mutant gene,"

Hi @brettasmi (also tagging @sierra-moxon) - I'd like to revisit the semantics you assign to the CIViC data in your associations, as I think it may not be representing the statements made in this knowledgebase accurately. Specifically, the statement made in the record you linked above (https://civicdb.org/events/genes/67/summary/variants/175/summary/evidence/447/summary#evidence) describes a variant's association with resistance or sensitivity to treatment with a medication/drug. From this, you might infer a Gene to Drug edge where the semantics are "Gene X has variants associated with sensitivity/resistance to' Drug Y. But I dont think you would infer that "The medication affects the expression of a mutant gene", as you seem to be doing above.

The timing for revisiting this is perfect, as we are in ongoing discussions with the Multiomics Provider about their need to represent associations that also describe variant impact on drug resistance/sensitivity. But their associations are based strictly on HTP cell line screening experiments - as opposed to the statements in CIViC that make similar assertions, but are based mainly on clinical trials data.

This is a very interesting combination of use cases that will highlight some useful requirements for modeling association semantics, and EPC metadata! Hoping we can discuss it on a call soon.

brettasmi commented 2 years ago

@mbrush

Thanks for your comments. I am not an expert in SPOKE's data and commented beyond my expertise on the plain English meaning of the CIVIC data, sorry.

Indeed, SPOKE's curators define those relationships as "Compound-AFFECTS-MutantGene" which is a bit higher level and does not specify anything about expression. However, when we had to map our internal relationships to the existing predicates in biolink, we chose "regulates" as the closest possible predicate at the time.

It sounds like a re-evaluation is necessary and the time is right to get something more accurate into biolink. I will defer to @karthiksoman, who is much better with the modeling of SPOKE than I am.

I suggest we close this issue and open a new one (here or somewhere more appropriate), as this discussion is now a bit out of scope from the PMID in source provenance attributes.