MetaCell / asu-olfactory

MIT License
0 stars 0 forks source link

Molecules that have no CID #7

Open rgerkin opened 2 years ago

rgerkin commented 2 years ago

There are valid molecules that have no CID (they have usually either never been synthesized or at least never characterized in public literature). In the PubChem data you've worked with, there are files (being normalized into tables) such as CIDSmiles, CIDInchi, and CIDInchiKey. Every CID has a SMILES, and Inchi, and InchiKey (InchiKey is just a hash of Inchi). SMILES is nice because it is sort of readable, but Inchi is truly unique. Every unique molecule is guaranteed to have exactly one Inchi. Inchi can be converted into SMILES, into images of the molecule, etc. And InchiKey is fixed length, and is good for fast one-way lookups.

Of the billions of molecules that can exist, only million have CIDs. So for any prospective mapping/prediction (e.g. what would this previously unsynthesized molecules be predicted to smell like?), the molecules might be a list of InChis or SMILES. For some of those we can link to a known CID using the CIDInchi table. But for others, we cannot, so there may be a need for tables which are indexed by Inchi. These do not need autocomplete. Even Inchikey is probably sufficient here. Converting from Inchi to anything else is one line in Python (with rdkit).

enicolasgomez commented 2 years ago

CIDInchiKey - fixed length, one way, hash (close to md5 POC random, no point to search substring, you could search the whole key) CIDInchi - not fixed length, string can be converted to molecule structure (10M rows) CIDSmiles - string can be converted to molecule structure (redeable, not unique, not reliable). Algorithms to produce from molecule structure.

All of these are not CSV. No synonyms.

enicolasgomez commented 2 years ago

https://pubchem.ncbi.nlm.nih.gov/#query=6