Open andrewsu opened 7 months ago
We perhaps need a better place to document this best-practice, but per the guidelines at https://github.com/biothings/biothings_explorer/blob/main/docs/README-contributing-new-data-source.md, let's add a link to the parser code and a link to an API call with an example record to this issue. @ctrl-schaff can I ask you to handle this please?
We perhaps need a better place to document this best-practice, but per the guidelines at https://github.com/biothings/biothings_explorer/blob/main/docs/README-contributing-new-data-source.md, let's add a link to the parser code and a link to an API call with an example record to this issue. @ctrl-schaff can I ask you to handle this please?
Sure no problem
For this plugin the parsing code can be found at https://github.com/biothings/pending.api/tree/master/plugins/pubtator3
Generated API call: https://biothings.ci.transltr.io/pubtator3/association/11270550-Disease|MESH:D008579-ASSOCIATE-Gene|57534
Generated Result:
{
"_id": "11270550-Disease|MESH:D008579-ASSOCIATE-Gene|57534",
"_version": 1,
"object": {
"identifier": {
"key": "MESH",
"value": "D008579"
},
"semantic_type_name": "Disease"
},
"pmid": 11270550,
"pmid_count": 1,
"predicate": "ASSOCIATE",
"predication_count": 1,
"subject": {
"identifier": {
"key": null,
"value": "57534"
},
"semantic_type_name": "Gene"
}
}
This plugin is currently deployed on the CI environment so feel free to test it there for more data samples.
Chunlei and I already discussed modifying this structure so we can eliminate the PMID
value from the _id
field. The internal data provided by pubtator has a fair amount of duplicates which is why I specified the PMID
in the _id
field in the first place so we would ignore a lot less entries while parsing. This highlighted an error in the difference between our merging backends between sqlite3 and mongodb which I'm currently modifying and will modify the structure of this plugin once I have it ready to test with both. If you have any other suggestions or issues with the data structure please let me know @andrewsu
There's a layer of aggregation that needs to be added to the parser. Consider this set of records linking D007037
to D008713
: https://biothings.ci.transltr.io/pubtator3/query?q=object.identifier.value:D008713%20AND%20subject.identifier.value:D007037&facets=predicate
There are 433 total records joining these terms, 383 using the cause
predicate, 49 using the treat
predicate, and 1 using the associates
predicate. So these 433 original records in pubtator3 should be collapsed into three records in our API with roughly this structure:
"hits": [
{
"_id": "Chemical|MESH:D008713-CAUSE-Disease|MESH:D007037",
"_score": 16.60503,
"object": {
"identifier": {
"key": "MESH",
"value": "D008713"
},
"semantic_type_name": "Chemical"
},
"pmid": [729631,17161219,20808432,15820614,...],
"pmid_count": 381,
"predicate": "CAUSE",
"predication_count": 383,
"subject": {
"identifier": {
"key": "MESH",
"value": "D007037"
},
"semantic_type_name": "Disease"
}
},
{
"_id": "Chemical|MESH:D008713-TREAT-Disease|MESH:D007037",
"_score": 16.60503,
"object": {
"identifier": {
"key": "MESH",
"value": "D008713"
},
"semantic_type_name": "Chemical"
},
"pmid": [26214210,26799350,23337033,2552340,...]
"pmid_count": 49,
"predicate": "TREAT",
"predication_count": 49,
"subject": {
"identifier": {
"key": "MESH",
"value": "D007037"
},
"semantic_type_name": "Disease"
}
},
{
"_id": "Chemical|MESH:D008713-ASSOCIATE-Disease|MESH:D007037",
"_score": 16.60503,
"object": {
"identifier": {
"key": "MESH",
"value": "D008713"
},
"semantic_type_name": "Chemical"
},
"pmid": [37931916],
"pmid_count": 1,
"predicate": "ASSOCIATE",
"predication_count": 1,
"subject": {
"identifier": {
"key": "MESH",
"value": "D007037"
},
"semantic_type_name": "Disease"
}
},
Note that the predication_count
refers to the number of original records with the same subject.identifier.value - predicate - object.identifier.value
triple, and the pmid
and pmid_count
refer to the number of unique PMIDs in the list of predications. Let me know if you have any questions!
... and adding two other tweaks to the parser. As always, let me know if any clarifications are needed...
name
field for diseases, chemicals, and genesAll chemicals use MESH
IDs. Those can be resolved to names using mychem.info, e.g., https://mychem.info/v1/query?q=umls.mesh:C579720. The name
can be drawn from this list of fields in the JSON (in order of priority):
Diseases either use MESH
(e.g., D015179
which can be searched using https://mydisease.info/v1/query?q=disease_ontology.xrefs.mesh:D015179%20OR%20umls.mesh.preferred:D015179%20OR%20ctd.mesh:D015179%20OR%20mondo.xrefs.mesh:D015179) or OMIM
(e.g., 610251
which can be searched by https://mydisease.info/v1/query?q=mondo.xrefs.omim:610251%20OR%20hpo.omim:610251%20OR%20ctd.omim:610251). The name
can be drawn from this list of fields in the JSON:
Genes are always specified by the NCBI Gene ID, resolved using https://mygene.info/v3/gene/1017. The name
can be drawn from this list:
CorrespondingGene
identifiersconsider this set of relations from the pubtator3 relations file:
$ gzip -cd relation2pubtator3.gz | grep 33847607
33847607 associate DNAMutation|RS#:1801131;HGVS:c.1298A>C;CorrespondingGene:4524 Disease|MESH:D053713
33847607 associate Disease|MESH:D053713 Gene|4524
33847607 cause DNAMutation|RS#:1801133;HGVS:c.677C>T;CorrespondingGene:4524 Disease|MESH:D053713
The corresponding records are here: https://biothings.ci.transltr.io/pubtator3/query?q=pmid:33847607, one of which is pasted below:
{
"_id": "33847607-DNAMutation|RS#:1801131;HGVS:c.1298A>C;CorrespondingGene:4524-ASSOCIATE-Disease|MESH:D053713",
"_score": 1,
"object": [
{
"identifier": {
"key": "RS#",
"value": "1801131"
},
"semantic_type_name": "DNAMutation"
},
{
"identifier": {
"key": "HGVS",
"value": "c.1298A>C"
},
"semantic_type_name": "DNAMutation"
},
{
"identifier": {
"key": "CorrespondingGene",
"value": "4524"
},
"semantic_type_name": "DNAMutation"
}
],
"pmid": 33847607,
"pmid_count": 1,
"predicate": "ASSOCIATE",
"predication_count": 1,
"subject": {
"identifier": {
"key": "MESH",
"value": "D053713"
},
"semantic_type_name": "Disease"
}
}
The identifier for CorrespondingGene:4524
should be removed, since there is already a separate record linking Gene|4524
to Disease|MESH:D053713
. (Spot checking a few other example, this redundancy appears to be universally true.)
Website: https://www.ncbi.nlm.nih.gov/research/pubtator3/ FTP: https://www.ncbi.nlm.nih.gov/research/pubtator3/ (We are most interested in relation2pubtator3.gz)
Pubtator3 is the latest iteration of pubtator from Zhiyong Lu's group at NCBI. It includes an analysis of the entire 35+ million abstracts in PubMed and nearly 6 million full-text articles in the PMC Text Mining subset, resulting in 1.6 billion entity annotations and 33 million extracted relations (8.8 unique pairs of entities).
Let's try to use the same structure as we did for the semmeddb API, e.g., https://biothings.transltr.io/semmeddb/association/C0040077-STIMULATES-C0076591
NOTE that pubtator 3 also has an API at https://www.ncbi.nlm.nih.gov/research/pubtator3/api, but their usage restrictions mean we should just set up our own...