Closed akokai closed 7 years ago
Here's the notebook: https://github.com/akokai/camelid/blob/synonyms/notebooks/pubchem_synonyms.ipynb
Update: solved by importing the text file into SQLite3 and performing a join with a table of relevant CIDs from Pharos. Then exporting that and running it through a filtering function from synutils.py
.
Reference: https://docs.google.com/document/d/1VPQUZGS5RA7QBDqom0OWk20oM5Zj4wrwEKT91B_BKBQ/edit
To do: make this an easily reproducible process capable of being executed with just a few commands on one machine (i.e. shell script, SQL script, Python script, etc.).
To achieve goals laid out in this comment by @mdedeo.
For synonyms, rather than accessing the full Compound database via FTP – as in here – we are just accessing one giant file that only contains synonym mappings to CIDs: see this README under
CID-Synonym-filtered.gz
.The new challenge here is to search for many string matches in an extremely large file, preferably without actually uncompressing it on disk. If solvable, this will be way faster than using the API. Looking into it... I'm sure someone has tackled this kind of thing elsewhere...
synonyms
.notebooks
directory if I get anything working.