PNNL-CompBio / coderdata

Automation scripts and benchmark dataset package for cancer drug prediction deep learning models.
Other
11 stars 3 forks source link

Why do we create new drug IDs (improve_drug_id)? #248

Open jjacobson95 opened 1 week ago

jjacobson95 commented 1 week ago

(Feel free to close once answered)

With every drug, we have a matching PubChem identifier, why do we have both IDs when one of them is publicly available and searchable already? Is this just an artifact from the IMPROVE project specifications?

Drug File:

Screenshot 2024-11-13 at 2 02 58 PM

Experiments File:

Screenshot 2024-11-13 at 2 03 57 PM

Experiments are linked to improve_sample_ids but this could easily be swapped to pubchem ID.

sgosline commented 1 week ago

I'm not sure what you mean by 'both ids'. We keep the pubchem ID and name so we dont have to pull data twice. For every dataset, we can search the drug data first (by name, pubchem id) and only if we come up empty do we have to poll pubchem.

sgosline commented 1 week ago

Pubchem Ids are NOT unique - there are many for a single structure.

jjacobson95 commented 1 week ago

I meant for improve_drug_id and pubchem_id

sgosline commented 1 week ago

We create a single improve_drug_id for a SMILES string.

jjacobson95 commented 1 week ago

Does each SMILES string correspond to just one pubchem_id?

sgosline commented 1 week ago

No.

jjacobson95 commented 1 week ago

Okay thanks, that was the source of my confusion.

jjacobson95 commented 1 week ago

Actually just following up a bit more on this as I think I've identified a possible discrepancy between improve_drug_id assignments between datasets.

In pubchem_retrieval.py, we assign improve_drug_id by Canonical Smiles.

In GetBeatAML.py, we assign improve_drug_id by Isomeric Smiles.

In nci60Drugs.py line 105, both ISOSMILES and SMILES appear to be assigned the same SMILES value.

jjacobson95 commented 1 week ago

Also this is my current understanding - feel free to correct:

A SMILES string may correspond to multiple structures. An ISOSMILES string may correspond to multiple but fewer structures than SMILES. An InChIKey corresponds to a single structure.

A single pubchem ID may map to multiple SMILES or ISOSMILES strings. A single pubchem ID will only map to a single InChIKey

sgosline commented 5 days ago

I think all claims above are correct except for the single pubchemID -to-InChIKey mapping, i'm pretty sure these are also many to one.