Add grounding for major entity types using EXTRACT 2.0 API

JohnGiorgi commented 5 years ago

This pull request implements grounding/entity linking for the major entity classes (Chemicals/Drugs, Disease/Disorder, Species/Living beings, and Proteins/Genes) using the EXTRACT 2.0 API. This is used in place of the grounding system we had previously, which only worked for protein/gene entities.

I tried to model the output format used by REACH as closely as possible. Grounding adds a new field to each item in ents (xrefs) in the output JSON returned by Saber.annotate(). E.g.,

Without grounding

{
  "text": "The phosphorylation of Hdm2 by MK2 promotes the ubiquitination of p53.",
  "ents": [
    {
      "start": 23,
      "end": 27,
      "text": "Hdm2",
      "label": "PRGE"
    },
    {
      "start": 31,
      "end": 34,
      "text": "MK2",
      "label": "PRGE"
    },
    {
      "start": 66,
      "end": 69,
      "text": "p53",
      "label": "PRGE"
    }
  ]
}

With grounding

{
  "text": "The phosphorylation of Hdm2 by MK2 promotes the ubiquitination of p53.",
  "ents": [
    {
      "start": 23,
      "end": 27,
      "text": "Hdm2",
      "label": "PRGE",
      "xrefs": [
        {
          "namespace": "STRING",
          "id": "ENSP00000258149",
          "organism-id": "9606"
        },
        {
          "namespace": "STRING",
          "id": "ENSP00000410769",
          "organism-id": "9606"
        }
      ]
    },
    {
      "start": 31,
      "end": 34,
      "text": "MK2",
      "label": "PRGE",
      "xrefs": [
        {
          "namespace": "STRING",
          "id": "ENSP00000356070",
          "organism-id": "9606"
        },
        {
          "namespace": "STRING",
          "id": "ENSP00000433109",
          "organism-id": "9606"
        }
      ]
    },
    {
      "start": 66,
      "end": 69,
      "text": "p53",
      "label": "PRGE",
      "xrefs": [
        {
          "namespace": "STRING",
          "id": "ENSP00000269305",
          "organism-id": "9606"
        }
      ]
    }
  ]
}

Where namespace is the external resource the entity is grounded to and id is the unique identifier in that external resource. organism-id is unique to PRGE entities. Currently, this will default to reporting 9606. The EXTRACT API docs state that a feature is currently being developed that will allow for the automatic detection of organism-id for each protein/gene mention. When / if that gets implemented, I will work it into Saber.

Usage

Usage is the same as before. Grounding is off by default as it makes annotation slightly slower

from saber import Saber

saber = Saber()

saber.load('PRGE')

saber.annotate('The phosphorylation of Hdm2 by MK2 promotes the ubiquitination of p53.', ground=True)

See the docs for more info.

Issues Closed

Closes #23.

JohnGiorgi commented 5 years ago

TODO

[x] Figure out namespace field.
[x] Re-write this so that it makes 1 request per entity type, not one request per entity. That should dramatically speed things up.
[x] Add try except block to allow annotation to proceed even if grounding fails.
[x] Lint the code.
[x] Sanity check that I didn't break any existing features. Write more unit tests.

coveralls commented 5 years ago

Pull Request Test Coverage Report for Build 319

90 of 96 (93.75%) changed or added relevant lines in 11 files are covered.
3 unchanged lines in 1 file lost coverage.
Overall coverage remained the same at 80.854%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
saber/utils/app_utils.py	0	1	0.0%
saber/saber.py	7	12	58.33%
<!--	Total:	90	96	93.75%	-->

Files with Coverage Reduction	New Missed Lines	%
saber/saber.py	3	71.22%
<!--	Total:	3		-->

Totals
Change from base Build 299:	0.0%
Covered Lines:	1799
Relevant Lines:	2225

BaderLab / saber

Add grounding for major entity types using EXTRACT 2.0 API #119

Pull Request Test Coverage Report for Build 319

💛 - Coveralls