Load HMDB data for protein-associated metabolites

andrewsu commented 3 years ago

"The Human Metabolome Database (HMDB) is a freely available electronic database containing detailed information about small molecule metabolites found in the human body" (from https://hmdb.ca/). HMDB contains links between proteins and the metabolites they are associated with. For example, the HMDB record for Homogentisate 1,2-dioxygenase (UniProtKB:Q93099) is HMDBP00842, and https://hmdb.ca/proteins/HMDBP00842/metabolite_protein_links shows the metabolites associated with this protein. These relationships can also be downloaded from the HMDB downloads page, and specifically the "All proteins" file.

This issue tracks the loading of these protein-associated metabolites to mygene.info.

andrewsu commented 3 years ago

Let's create this as a standalone "pending" API for now. Also, let's create this as an "association"-style API, where each document describes a triple (subject/object/predicate). This aligns with how we structured the semmeddb API described in this comment https://github.com/biothings/pending.api/issues/30#issuecomment-904319224.

NikkiBytes commented 3 years ago

Working example of the association structure for metabolite HMDB data below...

For an example of the data file and one protein, see here. This is what I am currently extracting from to get the structure below.

The protein data is in a nested .xml file. and the association ids are found under the metabolite_reference tag , and the metabolite_associations tag. The former contains a reference to the pmid, while metabolite_associations only contains id and name.

Currently I am adding the metabolite_associations data.

Below is a clean working version of the association structure of the metabolite_reference data.

I hope this is clear. @colleenXu and @newgene , if you could view the structure below and let me know if there are any details to modify or add. For all data from the file, preview the data file example .

As well, @andrewsu mentioned the HMDB ID for metabolites might not be used within Translator and that since HMDB has already done mappings to other database identifiers (e.g., https://hmdb.ca/metabolites/HMDB0015122#links), I should include those in the object dict. These links are not in the proteins file, so I am looking for a file to extract those from.

[
    {
        "_id": "HMDBP00001_1",
        "predicate": "biolink:related_to",
        "pmid": "11752352",
        "subject": {
            "uniprot_id": "P21589",
            "uniprot_name": "5NTD_HUMAN",
            "genbank_protein_id": "23897",
            "hgnc_id": "HGNC:8021",
            "genbank_gene_id": "X55740",
            "gene_name": "NT5E"
        },
        "object": {
            "name": "Pentoxifylline",
            "accession": "HMDB0014944"
        }
    },
    {
        "_id": "HMDBP00001_2",
        "predicate": "biolink:related_to",
        "pmid": "16426349",
        "subject": {
            "uniprot_id": "P21589",
            "uniprot_name": "5NTD_HUMAN",
            "genbank_protein_id": "23897",
            "hgnc_id": "HGNC:8021",
            "genbank_gene_id": "X55740",
            "gene_name": "NT5E"
        },
        "object": {
            "name": "Pentoxifylline",
            "accession": "HMDB0014944"
        }
    },
    {
        "_id": "HMDBP00001_3",
        "predicate": "biolink:related_to",
        "pmid": "19842938",
        "subject": {
            "uniprot_id": "P21589",
            "uniprot_name": "5NTD_HUMAN",
            "genbank_protein_id": "23897",
            "hgnc_id": "HGNC:8021",
            "genbank_gene_id": "X55740",
            "gene_name": "NT5E"
        },
        "object": {
            "name": "Cytarabine",
            "accession": "HMDB0015122"
        }
    }
]

colleenXu commented 3 years ago

@andrewsu @NikkiBytes A few questions after looking over the XML example vs the website:

I think the type of protein (it's under ) could be included. Perhaps it would be helpful for deciding what kind of association this is. I think possible values are "enzyme", "transporter", "cofactor", and "unknown" but I'm not sure
do we want to use biolink predicates here? It feels like that's inserting info that's not in the original data
the website under "biological properties" seems to have info on reactions that the protein catalyzes, which shows metabolites as input or outputs. I don't see it in the XML example for some reason...perhaps having that info would be nice because it might allow for a more specific metabolite-protein relationship for each pair...

considerations

I can find the reactions here....If only we can pull the data for all of these reactions...
At the bottom of the metabolite webpage, there seems to be even more info. The page says these are metabolite-enzyme relationships and specifies the reaction the enzyme does (so you can see when the chemical is an input or an output to the reaction). Again, strangely I don't see any of that info in the XML for the page.

Notes on modeling for translator / biolink:

looks like the biolink-model does not expect a direct link between the gene/protein and the metabolites...instead this shows Gene/Protein <-> Reaction/MolecularActivity. then the reaction has inputs or outputs that are the chemical entities (metabolites).

andrewsu commented 3 years ago

A few followups to @colleenXu's reply, hitting the bullet points in order:

sure, if protein type is easily added under subject.protein_type, sounds good. I don't see an immediate application of this so not the highest priority if not.
right, great point. We will add that in the smartAPI record, so let's remove predicate
hmm, agreed, links to specific reactions would be great. I'm not seeing a place where they would live in any of the other downloadable files, so I'm guessing they are excluding this purposely. Let's forge ahead with the data we have through the XML files.
same as the last point
same as the last point
I'm not sure how we'd model this all under a molecular activity in a way that would be accessible to BTE/Translator. And again, this is also dependent on having reaction info in a structured, downloadable file? So this extension of the parser is also a moot point unless we solve that?

@NikkiBytes for now, go ahead and move forward after making the changes described in the first two bullet points above.

NikkiBytes commented 2 years ago

Example of the newly edited structure, was able to pull the protein_type, and added the alternative IDs. Integrating into an API now.

[
    {
        "_id": "HMDBP00001_1",
        "pmid": "11752352",
        "subject": {
            "protein_type": "Unknown",
            "uniprot_id": "P21589",
            "uniprot_name": "5NTD_HUMAN",
            "genbank_protein_id": "23897",
            "hgnc_id": "HGNC:8021",
            "genbank_gene_id": "X55740",
            "gene_name": "NT5E"
        },
        "object": {
            "name": "Pentoxifylline",
            "accession": "HMDB0014944",
            "kegg_id": "C01092",
            "chemspider_id": "4578",
            "chebi_id": "127029",
            "pubchem_compound_id": "4740"
        }
    },
    {
        "_id": "HMDBP00001_2",
        "pmid": "16426349",
        "subject": {
            "protein_type": "Unknown",
            "uniprot_id": "P21589",
            "uniprot_name": "5NTD_HUMAN",
            "genbank_protein_id": "23897",
            "hgnc_id": "HGNC:8021",
            "genbank_gene_id": "X55740",
            "gene_name": "NT5E"
        },
        "object": {
            "name": "Pentoxifylline",
            "accession": "HMDB0014944",
            "kegg_id": "C01092",
            "chemspider_id": "4578",
            "chebi_id": "127029",
            "pubchem_compound_id": "4740"
        }
    },
.
.
.
.
]

NikkiBytes commented 2 years ago

Here is an example output of a single record generated with the parser....

{
    "_id": "HMDBP00001_1",
    "pmid": "11752352",
    "subject": {
        "protein_type": "Unknown",
        "uniprot_id": "P21589",
        "uniprot_name": "5NTD_HUMAN",
        "genbank_protein_id": "23897",
        "hgnc_id": "HGNC:8021",
        "genbank_gene_id": "X55740",
        "gene_name": "NT5E"
    },
    "object": {
        "name": "Pentoxifylline",
        "accession": "HMDB0014944",
        "kegg_id": "C07424",
        "chemspider_id": "4578",
        "chebi_id": "127029",
        "pubchem_compound_id": "4740"
    }
}

I think it addresses all the details mentioned .

Note: some records differ only in pmid . See here how the records are identical except for different pmid values. Making separate records is the current production method. If wanted, we can combine the pmid values into a list and have a single record. This is just a simple detail to consider.

When running my parser on BioThings Hub the dumper is successful, but the uploader is running into this problem:

Links to reference files: repo, parser file,manifest file

@colleenXu have you seen this error before? or is there something obviously wrong with the files, etc? I have been able to solve all errors up to this point, I have a few ideas of what this could be, but any feedback is appreciated, thank you! When this is solved its ready for the next steps.

zcqian commented 2 years ago

Can you paste the logs and stack trace here?

colleenXu commented 2 years ago

@NikkiBytes please follow up with @zcqian . I am not involved in the process of actually uploading / creating APIs...

NikkiBytes commented 2 years ago

Thank you @zcqian , the logs ....

root | OPTIONS args: ('prot_meta_assc_hmdb.prot_meta_assc_hmdb',), kwargs: {} | 2021-10-13T20:39:00
-- | -- | --
tornado.access | 200 OPTIONS /source/prot_meta_assc_hmdb.prot_meta_assc_hmdb/upload (172.17.0.1) 1.48ms | 2021-10-13T20:39:00
hub | Building task: functools.partial(<bound method UploaderManager.create_and_load of <UploaderManager [1 registered]: ['prot_meta_assc_hmdb']>>, <class 'biothings.hub.dataplugin.assistant.AssistedUploader_prot_meta_assc_hmdb'>, job_manager=<biothings.utils.manager.JobManager object at 0x7f9819dcaf98>) | 2021-10-13T20:39:00
upload_prot_meta_assc_hmdb | Uploading 'prot_meta_assc_hmdb' (collection: prot_meta_assc_hmdb) | 2021-10-13T20:39:00
tornado.access | 200 PUT /source/prot_meta_assc_hmdb.prot_meta_assc_hmdb/upload (172.17.0.1) 31.61ms | 2021-10-13T20:39:00
root | Can't find hard-coded mapping, now searching src_master: Not hard-coded mapping | 2021-10-13T20:39:00
tornado.access | 200 GET /commands?running=1 (172.17.0.1) 21.04ms | 2021-10-13T20:39:00
tornado.access | 200 GET /source/prot_meta_assc_hmdb (172.17.0.1) 20.53ms | 2021-10-13T20:39:00
upload_prot_meta_assc_hmdb | Load data from directory: '/data/biothings_studio/datasources/prot_meta_assc_hmdb/2020-09-08' | 2021-10-13T20:39:00
root | Uploading to the DB... | 2021-10-13T20:39:00
root | Can't find hard-coded mapping, now searching src_master: Not hard-coded mapping | 2021-10-13T20:39:00
tornado.access | 304 GET /source/prot_meta_assc_hmdb (172.17.0.1) 33.08ms | 2021-10-13T20:39:00
tornado.access | 200 GET /job_manager (172.17.0.1) 9.13ms | 2021-10-13T20:39:00
tornado.access | 200 GET /job_manager (172.17.0.1) 127.38ms | 2021-10-13T20:39:03
tornado.access | 200 GET /job_manager (172.17.0.1) 5.70ms | 2021-10-13T20:39:06
asyncio | Exception in callback JobManager.defer_to_process.<locals>.run.<locals>.ran(<Future finis...r pending.',)>) at /home/biothings/biothings_studio/biothings/utils/manager.py:685 handle: <Handle JobManager.defer_to_process.<locals>.run.<locals>.ran(<Future finis...r pending.',)>) at /home/biothings/biothings_studio/biothings/utils/manager.py:685> Traceback (most recent call last):   File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run     self._callback(*self._args)   File "/home/biothings/biothings_studio/biothings/utils/manager.py", line 688, in ran     r = f.result() concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending. | 2021-10-13T20:39:08
asyncio | Exception in callback BaseSourceUploader.update_data.<locals>.uploaded(<Future finis...r pending.',)>) at /home/biothings/biothings_studio/biothings/hub/dataload/uploader.py:354 handle: <Handle BaseSourceUploader.update_data.<locals>.uploaded(<Future finis...r pending.',)>) at /home/biothings/biothings_studio/biothings/hub/dataload/uploader.py:354> Traceback (most recent call last):   File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run     self._callback(*self._args)   File "/home/biothings/biothings_studio/biothings/hub/dataload/uploader.py", line 356, in uploaded     if type(f.result()) != int:   File "/home/biothings/biothings_studio/biothings/utils/manager.py", line 694, in run     res = yield from res   File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run     self._callback(*self._args)   File "/home/biothings/biothings_studio/biothings/utils/manager.py", line 688, in ran     r = f.result() concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending. | 2021-10-13T20:39:08
upload_prot_meta_assc_hmdb | failed [steps=data,post,master,clean]: A process in the process pool was terminated abruptly while the future was running or pending. Traceback (most recent call last):   File "/home/biothings/biothings_studio/biothings/hub/dataload/uploader.py", line 487, in load     **kwargs)   File "/home/biothings/biothings_studio/biothings/hub/dataload/uploader.py", line 362, in update_data     yield from job   File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run     self._callback(*self._args)   File "/home/biothings/biothings_studio/biothings/hub/dataload/uploader.py", line 356, in uploaded     if type(f.result()) != int:   File "/home/biothings/biothings_studio/biothings/utils/manager.py", line 694, in run     res = yield from res   File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run     self._callback(*self._args)   File "/home/biothings/biothings_studio/biothings/utils/manager.py", line 688, in ran     r = f.result() concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending. | 2021-10-13T20:39:08
hub | failed: A process in the process pool was terminated abruptly while the future was running or pending. Traceback (most recent call last):   File "/home/biothings/biothings_studio/biothings/hub/dataload/uploader.py", line 798, in done     f.result()   File "/home/biothings/biothings_studio/biothings/hub/dataload/uploader.py", line 816, in create_and_load     yield from inst.load(*args, **kwargs)   File "/home/biothings/biothings_studio/biothings/hub/dataload/uploader.py", line 487, in load     **kwargs)   File "/home/biothings/biothings_studio/biothings/hub/dataload/uploader.py", line 362, in update_data     yield from job   File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run     self._callback(*self._args)   File "/home/biothings/biothings_studio/biothings/hub/dataload/uploader.py", line 356, in uploaded     if type(f.result()) != int:   File "/home/biothings/biothings_studio/biothings/utils/manager.py", line 694, in run     res = yield from res   File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run     self._callback(*self._args)   File "/home/biothings/biothings_studio/biothings/utils/manager.py", line 688, in ran     r = f.result() concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending. | 2021-10-13T20:39:08
root | Can't find hard-coded mapping, now searching src_master: Not hard-coded mapping | 2021-10-13T20:39:08
tornado.access | 200 GET /source/prot_meta_assc_hmdb (172.17.0.1) 29.37ms | 2021-10-13T20:39:08
root | Can't find hard-coded mapping, now searching src_master: Not hard-coded mapping | 2021-10-13T20:39:08
tornado.access | 304 GET /source/prot_meta_assc_hmdb (172.17.0.1) 18.32ms | 2021-10-13T20:39:08
tornado.access | 200 GET /commands?running=1 (172.17.0.1) 1.14ms | 2021-10-13T20:39:08
tornado.access | 200 GET /job_manager (172.17.0.1) 6.20ms

NikkiBytes commented 2 years ago

A few notes/updates on the parser.....

Fixed ERROR BAD CRC-32: The input metabolite file, hmdb_metabolites.xml from HMDB downloads, the files current version 5.0 is corrupt and produces the error. Version 4.0 works and is being used at the moment, we will want to look back on this later..
Structure still the same, as an association-centric style document.
Adding mapping file, testing on Biothings studio and forking to pending api

Document Structure Example

[
    {
        "_id": "HMDBP00001_1",
        "pmid": "11752352",
        "subject": {
            "protein_type": "Unknown",
            "uniprot_id": "P21589",
            "uniprot_name": "5NTD_HUMAN",
            "genbank_protein_id": "23897",
            "hgnc_id": "HGNC:8021",
            "genbank_gene_id": "X55740",
            "gene_name": "NT5E"
        },
        "object": {
            "name": "Pentoxifylline",
            "accession": "HMDB0014944",
            "kegg_id": "C07424",
            "chemspider_id": "4578",
            "chebi_id": "127029",
            "pubchem_compound_id": "4740"
        }
    },
    .
    .
    .
]

andrewsu commented 2 years ago

Looking at the thread above, looks like this data plugin is ready for deployment as a pending API... Assigning to @erikyao to evaluate...

erikyao commented 1 year ago

API published, https://biothings.ncats.io/hmdb

colleenXu commented 10 months ago

biothings / mygene.info

Load HMDB data for protein-associated metabolites #110