Should we do an ETL of PubChem?

RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)

MIT License

39 stars 8 forks source link

Should we do an ETL of PubChem? #22

Open saramsey opened 3 years ago

saramsey commented 3 years ago

PubChem has a download page: https://pubchemdocs.ncbi.nlm.nih.gov/downloads

and PubChem can provide names for some compounds (like CHEMBL.COMPOUND:CHEMBL1197434) that don't have names in ChEMBL. See issue RTXteam/RTX#1296.

ecwood commented 3 years ago

I see the benefit of this. PubChem identifiers show up often in other sources. (Recently Reactome, but also DrugBank, SMPDB/PathWhiz, and HMDB).

ecwood commented 3 years ago

If we do an ETL of PubChem, we are going to want to be very particular in what we want. There is a lot of data in each XML file (for example Substance_297000001_297500000.xml.gz is 5.5G unzipped) and there are A LOT of XML files (in the "CURRENT-Full" substance directory, there are 808 of those gzipped files--we would want more than just the substance directory though, especially the compound directory).

saramsey commented 3 years ago

If we do an ETL of PubChem, we are going to want to be very particular in what we want. There is a lot of data in each XML file (for example Substance_297000001_297500000.xml.gz is 5.5G unzipped) and there are A LOT of XML files (in the "CURRENT-Full" substance directory, there are 808 of those gzipped files--we would want more than just the substance directory though, especially the compound directory).

What about using the RDF files? https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/compound/general/