Open saramsey opened 3 years ago
I see the benefit of this. PubChem identifiers show up often in other sources. (Recently Reactome, but also DrugBank, SMPDB/PathWhiz, and HMDB).
If we do an ETL of PubChem, we are going to want to be very particular in what we want. There is a lot of data in each XML file (for example Substance_297000001_297500000.xml.gz
is 5.5G unzipped) and there are A LOT of XML files (in the "CURRENT-Full" substance directory, there are 808 of those gzipped files--we would want more than just the substance directory though, especially the compound directory).
If we do an ETL of PubChem, we are going to want to be very particular in what we want. There is a lot of data in each XML file (for example
Substance_297000001_297500000.xml.gz
is 5.5G unzipped) and there are A LOT of XML files (in the "CURRENT-Full" substance directory, there are 808 of those gzipped files--we would want more than just the substance directory though, especially the compound directory).
What about using the RDF files? https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/compound/general/
PubChem has a download page: https://pubchemdocs.ncbi.nlm.nih.gov/downloads
and PubChem can provide names for some compounds (like
CHEMBL.COMPOUND:CHEMBL1197434
) that don't have names in ChEMBL. See issue RTXteam/RTX#1296.