akokai / commongroups-archived

(archived) Molecular structure-based classification of chemicals in known hazard groups
MIT License
1 stars 0 forks source link

Bulk synonym retrieval from PubChem #29

Closed akokai closed 7 years ago

akokai commented 8 years ago

To achieve goals laid out in this comment by @mdedeo.

For synonyms, rather than accessing the full Compound database via FTP – as in here – we are just accessing one giant file that only contains synonym mappings to CIDs: see this README under CID-Synonym-filtered.gz.

The new challenge here is to search for many string matches in an extremely large file, preferably without actually uncompressing it on disk. If solvable, this will be way faster than using the API. Looking into it... I'm sure someone has tackled this kind of thing elsewhere...

akokai commented 8 years ago

Here's the notebook: https://github.com/akokai/camelid/blob/synonyms/notebooks/pubchem_synonyms.ipynb

akokai commented 7 years ago

Update: solved by importing the text file into SQLite3 and performing a join with a table of relevant CIDs from Pharos. Then exporting that and running it through a filtering function from synutils.py.

Reference: https://docs.google.com/document/d/1VPQUZGS5RA7QBDqom0OWk20oM5Zj4wrwEKT91B_BKBQ/edit

To do: make this an easily reproducible process capable of being executed with just a few commands on one machine (i.e. shell script, SQL script, Python script, etc.).

akokai commented 7 years ago

Moved synonym and ID related stuff into a separate project -- here's the repo -- since it started feeling like a divergence from the point of this project, which is chemical group definition & enumeration. Closing this issue and related to-do items from here.