BaderLab / saber

Saber is a deep-learning based tool for information extraction in the biomedical domain. Pull requests are welcome! Note: this is a work in progress. Many things are broken, and the codebase is not stable.
https://baderlab.github.io/saber/
MIT License
102 stars 17 forks source link

Add grounding for major entity types using EXTRACT 2.0 API #119

Closed JohnGiorgi closed 5 years ago

JohnGiorgi commented 5 years ago

This pull request implements grounding/entity linking for the major entity classes (Chemicals/Drugs, Disease/Disorder, Species/Living beings, and Proteins/Genes) using the EXTRACT 2.0 API. This is used in place of the grounding system we had previously, which only worked for protein/gene entities.

I tried to model the output format used by REACH as closely as possible. Grounding adds a new field to each item in ents (xrefs) in the output JSON returned by Saber.annotate(). E.g.,

Without grounding

{
  "text": "The phosphorylation of Hdm2 by MK2 promotes the ubiquitination of p53.",
  "ents": [
    {
      "start": 23,
      "end": 27,
      "text": "Hdm2",
      "label": "PRGE"
    },
    {
      "start": 31,
      "end": 34,
      "text": "MK2",
      "label": "PRGE"
    },
    {
      "start": 66,
      "end": 69,
      "text": "p53",
      "label": "PRGE"
    }
  ]
}

With grounding

{
  "text": "The phosphorylation of Hdm2 by MK2 promotes the ubiquitination of p53.",
  "ents": [
    {
      "start": 23,
      "end": 27,
      "text": "Hdm2",
      "label": "PRGE",
      "xrefs": [
        {
          "namespace": "STRING",
          "id": "ENSP00000258149",
          "organism-id": "9606"
        },
        {
          "namespace": "STRING",
          "id": "ENSP00000410769",
          "organism-id": "9606"
        }
      ]
    },
    {
      "start": 31,
      "end": 34,
      "text": "MK2",
      "label": "PRGE",
      "xrefs": [
        {
          "namespace": "STRING",
          "id": "ENSP00000356070",
          "organism-id": "9606"
        },
        {
          "namespace": "STRING",
          "id": "ENSP00000433109",
          "organism-id": "9606"
        }
      ]
    },
    {
      "start": 66,
      "end": 69,
      "text": "p53",
      "label": "PRGE",
      "xrefs": [
        {
          "namespace": "STRING",
          "id": "ENSP00000269305",
          "organism-id": "9606"
        }
      ]
    }
  ]
}

Where namespace is the external resource the entity is grounded to and id is the unique identifier in that external resource. organism-id is unique to PRGE entities. Currently, this will default to reporting 9606. The EXTRACT API docs state that a feature is currently being developed that will allow for the automatic detection of organism-id for each protein/gene mention. When / if that gets implemented, I will work it into Saber.

Usage

Usage is the same as before. Grounding is off by default as it makes annotation slightly slower

from saber import Saber

saber = Saber()

saber.load('PRGE')

saber.annotate('The phosphorylation of Hdm2 by MK2 promotes the ubiquitination of p53.', ground=True)

See the docs for more info.

Issues Closed

Closes #23.

JohnGiorgi commented 5 years ago

TODO

coveralls commented 5 years ago

Pull Request Test Coverage Report for Build 319


Changes Missing Coverage Covered Lines Changed/Added Lines %
saber/utils/app_utils.py 0 1 0.0%
saber/saber.py 7 12 58.33%
<!-- Total: 90 96 93.75% -->
Files with Coverage Reduction New Missed Lines %
saber/saber.py 3 71.22%
<!-- Total: 3 -->
Totals Coverage Status
Change from base Build 299: 0.0%
Covered Lines: 1799
Relevant Lines: 2225

💛 - Coveralls