biothings / mygene.info

MyGene.info: A BioThings API for gene annotations
http://mygene.info
Other
115 stars 20 forks source link

Load HMDB data for protein-associated metabolites #110

Closed andrewsu closed 8 months ago

andrewsu commented 3 years ago

"The Human Metabolome Database (HMDB) is a freely available electronic database containing detailed information about small molecule metabolites found in the human body" (from https://hmdb.ca/). HMDB contains links between proteins and the metabolites they are associated with. For example, the HMDB record for Homogentisate 1,2-dioxygenase (UniProtKB:Q93099) is HMDBP00842, and https://hmdb.ca/proteins/HMDBP00842/metabolite_protein_links shows the metabolites associated with this protein. These relationships can also be downloaded from the HMDB downloads page, and specifically the "All proteins" file.

This issue tracks the loading of these protein-associated metabolites to mygene.info.

Related to https://github.com/NCATSTranslator/testing/issues/49

andrewsu commented 3 years ago

Let's create this as a standalone "pending" API for now. Also, let's create this as an "association"-style API, where each document describes a triple (subject/object/predicate). This aligns with how we structured the semmeddb API described in this comment https://github.com/biothings/pending.api/issues/30#issuecomment-904319224.

NikkiBytes commented 3 years ago

Working example of the association structure for metabolite HMDB data below...

For an example of the data file and one protein, see here. This is what I am currently extracting from to get the structure below.

The protein data is in a nested .xml file. and the association ids are found under the metabolite_reference tag , and the metabolite_associations tag. The former contains a reference to the pmid, while metabolite_associations only contains id and name.

Currently I am adding the metabolite_associations data.

Below is a clean working version of the association structure of the metabolite_reference data.

I hope this is clear. @colleenXu and @newgene , if you could view the structure below and let me know if there are any details to modify or add. For all data from the file, preview the data file example .

As well, @andrewsu mentioned the HMDB ID for metabolites might not be used within Translator and that since HMDB has already done mappings to other database identifiers (e.g., https://hmdb.ca/metabolites/HMDB0015122#links), I should include those in the object dict. These links are not in the proteins file, so I am looking for a file to extract those from.

[
    {
        "_id": "HMDBP00001_1",
        "predicate": "biolink:related_to",
        "pmid": "11752352",
        "subject": {
            "uniprot_id": "P21589",
            "uniprot_name": "5NTD_HUMAN",
            "genbank_protein_id": "23897",
            "hgnc_id": "HGNC:8021",
            "genbank_gene_id": "X55740",
            "gene_name": "NT5E"
        },
        "object": {
            "name": "Pentoxifylline",
            "accession": "HMDB0014944"
        }
    },
    {
        "_id": "HMDBP00001_2",
        "predicate": "biolink:related_to",
        "pmid": "16426349",
        "subject": {
            "uniprot_id": "P21589",
            "uniprot_name": "5NTD_HUMAN",
            "genbank_protein_id": "23897",
            "hgnc_id": "HGNC:8021",
            "genbank_gene_id": "X55740",
            "gene_name": "NT5E"
        },
        "object": {
            "name": "Pentoxifylline",
            "accession": "HMDB0014944"
        }
    },
    {
        "_id": "HMDBP00001_3",
        "predicate": "biolink:related_to",
        "pmid": "19842938",
        "subject": {
            "uniprot_id": "P21589",
            "uniprot_name": "5NTD_HUMAN",
            "genbank_protein_id": "23897",
            "hgnc_id": "HGNC:8021",
            "genbank_gene_id": "X55740",
            "gene_name": "NT5E"
        },
        "object": {
            "name": "Cytarabine",
            "accession": "HMDB0015122"
        }
    }
]
colleenXu commented 3 years ago

@andrewsu @NikkiBytes A few questions after looking over the XML example vs the website:


considerations

Notes on modeling for translator / biolink:

andrewsu commented 3 years ago

A few followups to @colleenXu's reply, hitting the bullet points in order:

@NikkiBytes for now, go ahead and move forward after making the changes described in the first two bullet points above.

NikkiBytes commented 2 years ago

Example of the newly edited structure, was able to pull the protein_type, and added the alternative IDs. Integrating into an API now.

[
    {
        "_id": "HMDBP00001_1",
        "pmid": "11752352",
        "subject": {
            "protein_type": "Unknown",
            "uniprot_id": "P21589",
            "uniprot_name": "5NTD_HUMAN",
            "genbank_protein_id": "23897",
            "hgnc_id": "HGNC:8021",
            "genbank_gene_id": "X55740",
            "gene_name": "NT5E"
        },
        "object": {
            "name": "Pentoxifylline",
            "accession": "HMDB0014944",
            "kegg_id": "C01092",
            "chemspider_id": "4578",
            "chebi_id": "127029",
            "pubchem_compound_id": "4740"
        }
    },
    {
        "_id": "HMDBP00001_2",
        "pmid": "16426349",
        "subject": {
            "protein_type": "Unknown",
            "uniprot_id": "P21589",
            "uniprot_name": "5NTD_HUMAN",
            "genbank_protein_id": "23897",
            "hgnc_id": "HGNC:8021",
            "genbank_gene_id": "X55740",
            "gene_name": "NT5E"
        },
        "object": {
            "name": "Pentoxifylline",
            "accession": "HMDB0014944",
            "kegg_id": "C01092",
            "chemspider_id": "4578",
            "chebi_id": "127029",
            "pubchem_compound_id": "4740"
        }
    },
.
.
.
.
]
NikkiBytes commented 2 years ago

Here is an example output of a single record generated with the parser....

{
    "_id": "HMDBP00001_1",
    "pmid": "11752352",
    "subject": {
        "protein_type": "Unknown",
        "uniprot_id": "P21589",
        "uniprot_name": "5NTD_HUMAN",
        "genbank_protein_id": "23897",
        "hgnc_id": "HGNC:8021",
        "genbank_gene_id": "X55740",
        "gene_name": "NT5E"
    },
    "object": {
        "name": "Pentoxifylline",
        "accession": "HMDB0014944",
        "kegg_id": "C07424",
        "chemspider_id": "4578",
        "chebi_id": "127029",
        "pubchem_compound_id": "4740"
    }
}

I think it addresses all the details mentioned .

Note: some records differ only in pmid . See here how the records are identical except for different pmid values. Making separate records is the current production method. If wanted, we can combine the pmid values into a list and have a single record. This is just a simple detail to consider.

When running my parser on BioThings Hub the dumper is successful, but the uploader is running into this problem:

upload_error_hmdbdata

Links to reference files: repo, parser file,manifest file

@colleenXu have you seen this error before? or is there something obviously wrong with the files, etc? I have been able to solve all errors up to this point, I have a few ideas of what this could be, but any feedback is appreciated, thank you! When this is solved its ready for the next steps.

zcqian commented 2 years ago

Can you paste the logs and stack trace here?

colleenXu commented 2 years ago

@NikkiBytes please follow up with @zcqian . I am not involved in the process of actually uploading / creating APIs...

NikkiBytes commented 2 years ago

Thank you @zcqian , the logs ....

root | OPTIONS args: ('prot_meta_assc_hmdb.prot_meta_assc_hmdb',), kwargs: {} | 2021-10-13T20:39:00
-- | -- | --
tornado.access | 200 OPTIONS /source/prot_meta_assc_hmdb.prot_meta_assc_hmdb/upload (172.17.0.1) 1.48ms | 2021-10-13T20:39:00
hub | Building task: functools.partial(<bound method UploaderManager.create_and_load of <UploaderManager [1 registered]: ['prot_meta_assc_hmdb']>>, <class 'biothings.hub.dataplugin.assistant.AssistedUploader_prot_meta_assc_hmdb'>, job_manager=<biothings.utils.manager.JobManager object at 0x7f9819dcaf98>) | 2021-10-13T20:39:00
upload_prot_meta_assc_hmdb | Uploading 'prot_meta_assc_hmdb' (collection: prot_meta_assc_hmdb) | 2021-10-13T20:39:00
tornado.access | 200 PUT /source/prot_meta_assc_hmdb.prot_meta_assc_hmdb/upload (172.17.0.1) 31.61ms | 2021-10-13T20:39:00
root | Can't find hard-coded mapping, now searching src_master: Not hard-coded mapping | 2021-10-13T20:39:00
tornado.access | 200 GET /commands?running=1 (172.17.0.1) 21.04ms | 2021-10-13T20:39:00
tornado.access | 200 GET /source/prot_meta_assc_hmdb (172.17.0.1) 20.53ms | 2021-10-13T20:39:00
upload_prot_meta_assc_hmdb | Load data from directory: '/data/biothings_studio/datasources/prot_meta_assc_hmdb/2020-09-08' | 2021-10-13T20:39:00
root | Uploading to the DB... | 2021-10-13T20:39:00
root | Can't find hard-coded mapping, now searching src_master: Not hard-coded mapping | 2021-10-13T20:39:00
tornado.access | 304 GET /source/prot_meta_assc_hmdb (172.17.0.1) 33.08ms | 2021-10-13T20:39:00
tornado.access | 200 GET /job_manager (172.17.0.1) 9.13ms | 2021-10-13T20:39:00
tornado.access | 200 GET /job_manager (172.17.0.1) 127.38ms | 2021-10-13T20:39:03
tornado.access | 200 GET /job_manager (172.17.0.1) 5.70ms | 2021-10-13T20:39:06
asyncio | Exception in callback JobManager.defer_to_process.<locals>.run.<locals>.ran(<Future finis...r pending.',)>) at /home/biothings/biothings_studio/biothings/utils/manager.py:685 handle: <Handle JobManager.defer_to_process.<locals>.run.<locals>.ran(<Future finis...r pending.',)>) at /home/biothings/biothings_studio/biothings/utils/manager.py:685> Traceback (most recent call last):   File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run     self._callback(*self._args)   File "/home/biothings/biothings_studio/biothings/utils/manager.py", line 688, in ran     r = f.result() concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending. | 2021-10-13T20:39:08
asyncio | Exception in callback BaseSourceUploader.update_data.<locals>.uploaded(<Future finis...r pending.',)>) at /home/biothings/biothings_studio/biothings/hub/dataload/uploader.py:354 handle: <Handle BaseSourceUploader.update_data.<locals>.uploaded(<Future finis...r pending.',)>) at /home/biothings/biothings_studio/biothings/hub/dataload/uploader.py:354> Traceback (most recent call last):   File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run     self._callback(*self._args)   File "/home/biothings/biothings_studio/biothings/hub/dataload/uploader.py", line 356, in uploaded     if type(f.result()) != int:   File "/home/biothings/biothings_studio/biothings/utils/manager.py", line 694, in run     res = yield from res   File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run     self._callback(*self._args)   File "/home/biothings/biothings_studio/biothings/utils/manager.py", line 688, in ran     r = f.result() concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending. | 2021-10-13T20:39:08
upload_prot_meta_assc_hmdb | failed [steps=data,post,master,clean]: A process in the process pool was terminated abruptly while the future was running or pending. Traceback (most recent call last):   File "/home/biothings/biothings_studio/biothings/hub/dataload/uploader.py", line 487, in load     **kwargs)   File "/home/biothings/biothings_studio/biothings/hub/dataload/uploader.py", line 362, in update_data     yield from job   File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run     self._callback(*self._args)   File "/home/biothings/biothings_studio/biothings/hub/dataload/uploader.py", line 356, in uploaded     if type(f.result()) != int:   File "/home/biothings/biothings_studio/biothings/utils/manager.py", line 694, in run     res = yield from res   File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run     self._callback(*self._args)   File "/home/biothings/biothings_studio/biothings/utils/manager.py", line 688, in ran     r = f.result() concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending. | 2021-10-13T20:39:08
hub | failed: A process in the process pool was terminated abruptly while the future was running or pending. Traceback (most recent call last):   File "/home/biothings/biothings_studio/biothings/hub/dataload/uploader.py", line 798, in done     f.result()   File "/home/biothings/biothings_studio/biothings/hub/dataload/uploader.py", line 816, in create_and_load     yield from inst.load(*args, **kwargs)   File "/home/biothings/biothings_studio/biothings/hub/dataload/uploader.py", line 487, in load     **kwargs)   File "/home/biothings/biothings_studio/biothings/hub/dataload/uploader.py", line 362, in update_data     yield from job   File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run     self._callback(*self._args)   File "/home/biothings/biothings_studio/biothings/hub/dataload/uploader.py", line 356, in uploaded     if type(f.result()) != int:   File "/home/biothings/biothings_studio/biothings/utils/manager.py", line 694, in run     res = yield from res   File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run     self._callback(*self._args)   File "/home/biothings/biothings_studio/biothings/utils/manager.py", line 688, in ran     r = f.result() concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending. | 2021-10-13T20:39:08
root | Can't find hard-coded mapping, now searching src_master: Not hard-coded mapping | 2021-10-13T20:39:08
tornado.access | 200 GET /source/prot_meta_assc_hmdb (172.17.0.1) 29.37ms | 2021-10-13T20:39:08
root | Can't find hard-coded mapping, now searching src_master: Not hard-coded mapping | 2021-10-13T20:39:08
tornado.access | 304 GET /source/prot_meta_assc_hmdb (172.17.0.1) 18.32ms | 2021-10-13T20:39:08
tornado.access | 200 GET /commands?running=1 (172.17.0.1) 1.14ms | 2021-10-13T20:39:08
tornado.access | 200 GET /job_manager (172.17.0.1) 6.20ms
NikkiBytes commented 2 years ago

A few notes/updates on the parser.....

Document Structure Example

[
    {
        "_id": "HMDBP00001_1",
        "pmid": "11752352",
        "subject": {
            "protein_type": "Unknown",
            "uniprot_id": "P21589",
            "uniprot_name": "5NTD_HUMAN",
            "genbank_protein_id": "23897",
            "hgnc_id": "HGNC:8021",
            "genbank_gene_id": "X55740",
            "gene_name": "NT5E"
        },
        "object": {
            "name": "Pentoxifylline",
            "accession": "HMDB0014944",
            "kegg_id": "C07424",
            "chemspider_id": "4578",
            "chebi_id": "127029",
            "pubchem_compound_id": "4740"
        }
    },
    .
    .
    .
]
andrewsu commented 2 years ago

Looking at the thread above, looks like this data plugin is ready for deployment as a pending API... Assigning to @erikyao to evaluate...

erikyao commented 1 year ago

API published, https://biothings.ncats.io/hmdb

colleenXu commented 10 months ago

Related infores stuff is ready:

colleenXu commented 8 months ago

Going to close this issue and open another one for the SmartAPI yaml w/ x-bte annotation writing