Create new API for pubtator3

Website: https://www.ncbi.nlm.nih.gov/research/pubtator3/ FTP: https://www.ncbi.nlm.nih.gov/research/pubtator3/ (We are most interested in relation2pubtator3.gz)

Pubtator3 is the latest iteration of pubtator from Zhiyong Lu's group at NCBI. It includes an analysis of the entire 35+ million abstracts in PubMed and nearly 6 million full-text articles in the PMC Text Mining subset, resulting in 1.6 billion entity annotations and 33 million extracted relations (8.8 unique pairs of entities).

Let's try to use the same structure as we did for the semmeddb API, e.g., https://biothings.transltr.io/semmeddb/association/C0040077-STIMULATES-C0076591

{
  "_id": "C0040077-STIMULATES-C0076591",
  "_version": 1,
  "object": {
    "name": "thymidylate synthase-dihydrofolate reductase",
    "novelty": 1,
    "semantic_type_abbreviation": "gngm",
    "semantic_type_name": "Gene or Genome",
    "umls": "C0076591"
  },
  "pmid_count": 1,
  "predicate": "STIMULATES",
  "predication": [
    {
      "object_score": 1000,
      "object_text": "dhfr-ts",
      "pmid": 7479765,
      "predication_id": 107061205,
      "sentence": "Survival and replication of dhfr-ts- in macrophages in vitro were dependent upon thymidine, with parasites differentiating into amastigotes prior to destruction. dhfr-ts- parasites persisted in BALB/c mice for up to 2 months, declining with a half-life of 2-3 days.",
      "sentence_id": 87288544,
      "subject_score": 1000,
      "subject_text": "thymidine"
    }
  ],
  "predication_count": 1,
  "subject": {
    "name": "Thymidine",
    "novelty": 1,
    "semantic_type_abbreviation": "bacs",
    "semantic_type_name": "Biologically Active Substance",
    "umls": "C0040077"
  }
}

NOTE that pubtator 3 also has an API at https://www.ncbi.nlm.nih.gov/research/pubtator3/api, but their usage restrictions mean we should just set up our own...

We perhaps need a better place to document this best-practice, but per the guidelines at https://github.com/biothings/biothings_explorer/blob/main/docs/README-contributing-new-data-source.md, let's add a link to the parser code and a link to an API call with an example record to this issue. @ctrl-schaff can I ask you to handle this please?

We perhaps need a better place to document this best-practice, but per the guidelines at https://github.com/biothings/biothings_explorer/blob/main/docs/README-contributing-new-data-source.md, let's add a link to the parser code and a link to an API call with an example record to this issue. @ctrl-schaff can I ask you to handle this please?

Sure no problem

For this plugin the parsing code can be found at https://github.com/biothings/pending.api/tree/master/plugins/pubtator3

Generated API call: https://biothings.ci.transltr.io/pubtator3/association/11270550-Disease|MESH:D008579-ASSOCIATE-Gene|57534

Generated Result:

{
  "_id": "11270550-Disease|MESH:D008579-ASSOCIATE-Gene|57534",
  "_version": 1,
  "object": {
    "identifier": {
      "key": "MESH",
      "value": "D008579"
    },
    "semantic_type_name": "Disease"
  },
  "pmid": 11270550,
  "pmid_count": 1,
  "predicate": "ASSOCIATE",
  "predication_count": 1,
  "subject": {
    "identifier": {
      "key": null,
      "value": "57534"
    },
    "semantic_type_name": "Gene"
  }
}

This plugin is currently deployed on the CI environment so feel free to test it there for more data samples.

Chunlei and I already discussed modifying this structure so we can eliminate the PMID value from the _id field. The internal data provided by pubtator has a fair amount of duplicates which is why I specified the PMID in the _id field in the first place so we would ignore a lot less entries while parsing. This highlighted an error in the difference between our merging backends between sqlite3 and mongodb which I'm currently modifying and will modify the structure of this plugin once I have it ready to test with both. If you have any other suggestions or issues with the data structure please let me know @andrewsu

There's a layer of aggregation that needs to be added to the parser. Consider this set of records linking D007037 to D008713: https://biothings.ci.transltr.io/pubtator3/query?q=object.identifier.value:D008713%20AND%20subject.identifier.value:D007037&facets=predicate

There are 433 total records joining these terms, 383 using the cause predicate, 49 using the treat predicate, and 1 using the associates predicate. So these 433 original records in pubtator3 should be collapsed into three records in our API with roughly this structure:

    "hits": [
        {
            "_id": "Chemical|MESH:D008713-CAUSE-Disease|MESH:D007037",
            "_score": 16.60503,
            "object": {
                "identifier": {
                    "key": "MESH",
                    "value": "D008713"
                },
                "semantic_type_name": "Chemical"
            },
            "pmid": [729631,17161219,20808432,15820614,...],
            "pmid_count": 381,
            "predicate": "CAUSE",
            "predication_count": 383,
            "subject": {
                "identifier": {
                    "key": "MESH",
                    "value": "D007037"
                },
                "semantic_type_name": "Disease"
            }
        },
        {
            "_id": "Chemical|MESH:D008713-TREAT-Disease|MESH:D007037",
            "_score": 16.60503,
            "object": {
                "identifier": {
                    "key": "MESH",
                    "value": "D008713"
                },
                "semantic_type_name": "Chemical"
            },
            "pmid": [26214210,26799350,23337033,2552340,...]
            "pmid_count": 49,
            "predicate": "TREAT",
            "predication_count": 49,
            "subject": {
                "identifier": {
                    "key": "MESH",
                    "value": "D007037"
                },
                "semantic_type_name": "Disease"
            }
        },
        {
            "_id": "Chemical|MESH:D008713-ASSOCIATE-Disease|MESH:D007037",
            "_score": 16.60503,
            "object": {
                "identifier": {
                    "key": "MESH",
                    "value": "D008713"
                },
                "semantic_type_name": "Chemical"
            },
            "pmid": [37931916],
            "pmid_count": 1,
            "predicate": "ASSOCIATE",
            "predication_count": 1,
            "subject": {
                "identifier": {
                    "key": "MESH",
                    "value": "D007037"
                },
                "semantic_type_name": "Disease"
            }
        },

Note that the predication_count refers to the number of original records with the same subject.identifier.value - predicate - object.identifier.value triple, and the pmid and pmid_count refer to the number of unique PMIDs in the list of predications. Let me know if you have any questions!

... and adding two other tweaks to the parser. As always, let me know if any clarifications are needed...

1. add `name` field for diseases, chemicals, and genes

All chemicals use MESH IDs. Those can be resolved to names using mychem.info, e.g., https://mychem.info/v1/query?q=umls.mesh:C579720. The name can be drawn from this list of fields in the JSON (in order of priority):

chebi.name
chembl.pref_name
drugbank.name
unii.display_name
umls.name

Diseases either use MESH (e.g., D015179 which can be searched using https://mydisease.info/v1/query?q=disease_ontology.xrefs.mesh:D015179%20OR%20umls.mesh.preferred:D015179%20OR%20ctd.mesh:D015179%20OR%20mondo.xrefs.mesh:D015179) or OMIM (e.g., 610251 which can be searched by https://mydisease.info/v1/query?q=mondo.xrefs.omim:610251%20OR%20hpo.omim:610251%20OR%20ctd.omim:610251). The name can be drawn from this list of fields in the JSON:

mondo.label
disease_ontology.name
hpo.disease_name

Genes are always specified by the NCBI Gene ID, resolved using https://mygene.info/v3/gene/1017. The name can be drawn from this list:

symbol
name

2. ignore `CorrespondingGene` identifiers

consider this set of relations from the pubtator3 relations file:

$ gzip -cd relation2pubtator3.gz | grep 33847607
33847607        associate       DNAMutation|RS#:1801131;HGVS:c.1298A>C;CorrespondingGene:4524   Disease|MESH:D053713
33847607        associate       Disease|MESH:D053713    Gene|4524
33847607        cause   DNAMutation|RS#:1801133;HGVS:c.677C>T;CorrespondingGene:4524    Disease|MESH:D053713

The corresponding records are here: https://biothings.ci.transltr.io/pubtator3/query?q=pmid:33847607, one of which is pasted below:

{
  "_id": "33847607-DNAMutation|RS#:1801131;HGVS:c.1298A>C;CorrespondingGene:4524-ASSOCIATE-Disease|MESH:D053713",
  "_score": 1,
  "object": [
    {
      "identifier": {
        "key": "RS#",
        "value": "1801131"
      },
      "semantic_type_name": "DNAMutation"
    },
    {
      "identifier": {
        "key": "HGVS",
        "value": "c.1298A>C"
      },
      "semantic_type_name": "DNAMutation"
    },
    {
      "identifier": {
        "key": "CorrespondingGene",
        "value": "4524"
      },
      "semantic_type_name": "DNAMutation"
    }
  ],
  "pmid": 33847607,
  "pmid_count": 1,
  "predicate": "ASSOCIATE",
  "predication_count": 1,
  "subject": {
    "identifier": {
      "key": "MESH",
      "value": "D053713"
    },
    "semantic_type_name": "Disease"
  }
}

The identifier for CorrespondingGene:4524 should be removed, since there is already a separate record linking Gene|4524 to Disease|MESH:D053713. (Spot checking a few other example, this redundancy appears to be universally true.)

biothings / pending.api