Open jjacobson95 opened 1 week ago
I'm not sure what you mean by 'both ids'. We keep the pubchem ID and name so we dont have to pull data twice. For every dataset, we can search the drug data first (by name, pubchem id) and only if we come up empty do we have to poll pubchem.
Pubchem Ids are NOT unique - there are many for a single structure.
I meant for improve_drug_id and pubchem_id
We create a single improve_drug_id for a SMILES string.
Does each SMILES string correspond to just one pubchem_id?
No.
Okay thanks, that was the source of my confusion.
Actually just following up a bit more on this as I think I've identified a possible discrepancy between improve_drug_id assignments between datasets.
In pubchem_retrieval.py, we assign improve_drug_id by Canonical Smiles.
In GetBeatAML.py, we assign improve_drug_id by Isomeric Smiles.
In nci60Drugs.py line 105, both ISOSMILES and SMILES appear to be assigned the same SMILES value.
Also this is my current understanding - feel free to correct:
A SMILES string may correspond to multiple structures. An ISOSMILES string may correspond to multiple but fewer structures than SMILES. An InChIKey corresponds to a single structure.
A single pubchem ID may map to multiple SMILES or ISOSMILES strings. A single pubchem ID will only map to a single InChIKey
I think all claims above are correct except for the single pubchemID -to-InChIKey mapping, i'm pretty sure these are also many to one.
(Feel free to close once answered)
With every drug, we have a matching PubChem identifier, why do we have both IDs when one of them is publicly available and searchable already? Is this just an artifact from the IMPROVE project specifications?
Drug File:
Experiments File:
Experiments are linked to improve_sample_ids but this could easily be swapped to pubchem ID.