NCATSTranslator / Text-Mining-Provider-Roadmap

Roadmap and issue tracking for the NCATS Translator Text Mining Provider
MIT License
2 stars 2 forks source link

Proposal for EPC representation for the text-mined association KG #78

Open bill-baumgartner opened 3 years ago

bill-baumgartner commented 3 years ago

For each text-mined Biolink association, we would like to provide relevant EPC data including:

This goal of this issue is to discuss how to represent the EPC data using the Attribute object that is defined in the TRAPI specification.

An initial proposal for Attribute representation is available in this document.

The proposal in this issue builds off of the original, and specifically addresses a need to group EPC into individual packets that contain the sentence and other relevant information so that multiple EPC packets can be associated with a single assertion.

Data for a text-mined assertion

graph": {
          "nodes": [
            {
              "id": "n0",
              "type": "biolink:ChemicalSubstance",
              "curie": "CHEBI:3215"      # bupivacaine
            },
            {
              "id": "n1",
              "type": "biolink:GeneOrGeneProduct",
              "curie": "PR:000031567"    # LRRC3B 
            }
          ],
          "edges": [
            {
              "id": "e0",
              "source_id": "n0",
              "target_id": "n1",
              "type": "biolink:negatively_regulates_entity_to_entity"
            }
          ]
     }
# This assertion is supported by two sentences in the literature
      {
        'publication': 'PMID:29085514', 
        'score': '0.99956816', 
        'sentence': 'The administration of 50 µg/ml bupivacaine promoted maximum breast cancer cell invasion, and suppressed LRRC3B mRNA expression in cells.', 
        'subject_spans': 'start: 31, end: 42', 
        'object_spans': 'start: 104, end: 110', 
        'provided_by': 'TMProvider'
      }

      {
        'publication': 'PMID:12345678', 
        'score': '0.876', 
        'sentence': 'This is a second sentence indicating that bupivacaine negatively regulates LRRC3B.', 
        'subject_spans': 'start: 42, end: 53', 
        'object_spans': 'start: 75, end: 81', 
        'provided_by': 'TMProvider'
      }

Proposed Attribute representation

The proposed Attribute representation models this assertion as a single edge between bupivacaine and LRRC3B with two accompanying Attributes representing the EPC data. Nested Attributes are used to allow each packet of sentence information to be self-contained. Also demonstrated are attributes representing a confidence score for the concept recognition of each node (concept), and an aggregate confidence score computed for each edge.

nodes:
  - id: CHEBI:3215
     category: biolink:ChemicalSubstance
     name: "bupivacaine"
     attributes:
        - attribute_type_id: SEPIO:0000168  # confidence_score
           attribute_from_source: "has confidence score"
           value: 0.7578
           value_type_id: biolink:ConfidenceLevel
           value_type_from_source: "confidence score"
           value_source: TMProvider

  - id: PR:000031567
     category: biolink:GeneOrGeneProduct
     name: "LRRC3B"
     attributes:
        - attribute_type_id: SEPIO:0000168  # confidence_score
           attribute_from_source: "has confidence score"
           value: 0.5467
           value_type_id: biolink:ConfidenceLevel
           value_type_from_source: "confidence score"
           value_source: TMProvider

edges: 
  - id: tmkp.Association001
    category: biolink:ChemicalToGeneAssociation
    subject: CHEBI:3215          # bupivacaine
    predicate: biolink:negatively_regulates_entity_to_entity
    object: PR:000031567       # LRRC3B 
    attributes:

    - attribute_type_id: SEPIO:0000438  # has_supporting_evidence_from_source
       attribute_from_source:  "source publication"    # what the source might have called the relationship
       value: PMID:29085514
       value_type_id: biolink:Publication          # here a biolink term is used to type the value.
       value_type_from_source: "PMID"
       value_source: TMProvider
       attributes:
          - attribute_type_id: SIO:000028  # has part
             value: "The administration of 50 µg/ml bupivacaine promoted maximum breast cancer cell invasion, and suppressed LRRC3B mRNA expression in cells."
             value_type_id: EDAM:data_3671     # text, or SIO:000113 'sentence'       
             value_type_from_source:  sentence text   
             attributes:
                 - attribute_type_id: SIO:000028  # has part
                    value: '31|42'
                    value_type_id: SIO:001056 # character position
                    value_type_from_source:  subject span   
                 - attribute_type_id: SIO:000028  # has part
                    value: '104|110'
                    value_type_id: SIO:001056 # character position
                    value_type_from_source:  object span               
                 - attribute_type_id: SEPIO:0000440  # has_supporting_evidence  
                    value: 0.99956816
                    value_type_id: EDAM:data_1772     # score 
                    value_type_from_source:  sentence confidence score          
                    value_source: TMProvider BERT model v0.1

    - attribute_type_id: SEPIO:0000438  # has_supporting_evidence_from_source
       attribute_from_source:  "source publication"    # what the source might have called the relationship
       value: PMID:12345678
       value_type_id: biolink:Publication          # here a biolink term is used to type the value.
       value_type_from_source: "PMID"
       value_source: TMProvider
       attributes:
          - attribute_type_id: SIO:000028  # has part
             value: "This is a second sentence indicating that bupivacaine negatively regulates LRRC3B.'"
             value_type_id: EDAM:data_3671     # text, or SIO:000113 'sentence'       
             value_type_from_source:  sentence text   
             attributes:
                 - attribute_type_id: SIO:000028  # has part
                    value: '42|53'
                    value_type_id: SIO:001056 # character position
                    value_type_from_source:  subject span   
                 - attribute_type_id: SIO:000028  # has part
                    value: '75|81'
                    value_type_id: SIO:001056 # character position
                    value_type_from_source:  object span               
                 - attribute_type_id: SEPIO:0000440  # has_supporting_evidence  
                    value: 0.876
                    value_type_id: EDAM:data_1772     # score 
                    value_type_from_source:  sentence confidence score          
                    value_source: TMProvider BERT model v0.1

    - attribute_type_id: SEPIO:0000168  # confidence_score
       attribute_from_source: "has aggregate confidence score"
       value: 0.64711234
       value_type_id: biolink:ConfidenceLevel
       value_type_from_source: "aggregate confidence score"
       value_source: TMProvider
bill-baumgartner commented 3 years ago

For comparison purposes, shown below is an alternative approach that uses no nesting of Attributes, and instead makes use of arrays to specify attribute values. For a given EPC packet, the sentence, score, subject & object spans, and PMID are inherently connected based on the array index used to store their values.

Note: This is the current output format used by the Service Provider to serve up the Text Mining Provider text-mined Biolink association KG.

edges:
  - id: 9445e98f72ada21aa572559e303e4d5ac414650f
    predicate: biolink:negatively_regulates,
    subject: CHEBI:3215          # bupivacaine
    object: PR:000031567       # LRRC3B
    attributes:
      - type: biolink:provided_by
        name: provided_by
        value: Text Mining KP
      - type: bts:api
        name: api
        value: Text Mining Targeted Association API
      - type: bts:score
        name: score
        value: 
          - 0.99956816
          - 0.876
      - type: bts:sentence
        name: sentence
        value: 
          - "The administration of 50 µg/ml bupivacaine promoted maximum breast cancer cell invasion, and suppressed LRRC3B mRNA expression in cells."
          - "This is a second sentence indicating that bupivacaine negatively regulates LRRC3B."
      - type: bts:subject_spans
        name: subject_spans
        value: 
          - "31|42"
          - "42|53"
      - type: bts:object_spans
        name: object_spans
        value: 
          - "104|110"
          - "75|81"
      - type: bts:publications
        name: publications
        value: 
          - PMID:29085514
          - PMID:12345678