bpmunson / polygon

POLYGON VAE For de novo Polypharmacology
MIT License
23 stars 7 forks source link

Missing Files for ../data/MTOR_ligand_smiles.txt and ../data/MEK1_ligand_smiles.txt #4

Open Feriolet opened 5 days ago

Feriolet commented 5 days ago

Hi! I am trying to generate molecules based on MTOR and MEK1 as written in the Github repo, but I am missing the ligand_smiles.txt for both protein, as stated in the scoring_definition.csv. Are both of them supposed to be written in the utils/train_ligand_binding_model.py? I have edited the script to write the smiles used to generate the morgan fingerprint and to train the random forest, as shown below:

 def train_ligand_binding_model(target_unit_pro_id,binding_db_path,output_path):
    #original code
    # convert to fingerprint
    fps = []
    values = []
    valid_smiles = []
    for x,y in d[['smiles','metric_value']].values:
        try:
            fp = AllChem.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(x),2)
        except:
            continue
        valid_smiles.append(x)
        fps.append(fp)
        values.append(y)

    X = np.array(fps)
    y = np.array(values)
    #modified to prevent ValueError: y value has inf 
    try:
        regr = RandomForestRegressor(n_estimators=1000,random_state=0,n_jobs=-1)
        regr.fit(X,y)
        regr.score(X,y)
    except ValueError:
        y = np.nan_to_num(y.astype(np.float32))
        regr = RandomForestRegressor(n_estimators=1000,random_state=0,n_jobs=-1)
        regr.fit(X,y)
        regr.score(X,y)

    logging.debug(regr.score(X,y))

    if output_path is None:
        output_path = f'{target_unit_pro_id}_rfr_ligand_model.pt'

    output_smiles_path = f'{target_unit_pro_id}_ligand_smiles.txt'

    with open(output_path, 'wb') as handle:
        s = pickle.dump(regr, handle)

    with open(output_smiles_path, 'w') as smile_handle:
        for smi in valid_smiles:
            smile_handle.writelines(f'{smi}\n')

    return 1

Thanks!

munsonbp commented 4 days ago

Hello Feriolet,

The ligand_smiles.txt files contain SMILES strings of example compounds that the model uses to judge the generated structures against. In the case of dual targeting compounds against two proteins (e.g. MEK1 and mTOR), those example compounds can be structures that have an "sufficient" inhibitory effect on their respective target. "Sufficient" can be any threshold you like but we used compounds that had a measured IC50 of less than 1 µM.

You may get these example compounds from anywhere you'd like but we used BindingDB https://www.bindingdb.org/rwd/bind/index.jsp and Pharos https://pharos.nih.gov/.

Here is the relevant section from the methods that might better illustrate how to get the example structures.

Classifying compounds against protein kinase targets Relevant to Fig. 2b https://www.nature.com/articles/s41467-024-47120-y#Fig2, Supplementary Fig. 1d https://www.nature.com/articles/s41467-024-47120-y#MOESM1, 2 https://www.nature.com/articles/s41467-024-47120-y#MOESM1. We queried the Pharos27 https://www.nature.com/articles/s41467-024-47120-y#ref-CR27 GraphQL API and the BindingDB25 https://www.nature.com/articles/s41467-024-47120-y#ref-CR25 for small molecule ligands against a list of 31 kinase proteins previously implicated in human cancer24 https://www.nature.com/articles/s41467-024-47120-y#ref-CR24. In concordance with the recommendations of the Pharos web interface, we selected ligands with an IC50 concentration of less than 1 µM against a given protein kinase target. We filtered the list of kinases to those with more than 300 ligands, resulting in the download of a total of 18,982 compounds each targeting one of 24 distinct protein kinases.

Hope this helps!

All the best, Brenton

On Mon, Jul 1, 2024 at 7:52 PM Feriolet @.***> wrote:

Hi! I am trying to generate molecules based on MTOR and MEK1 as written in the Github repo, but I am missing the ligand_smiles.txt for both protein, as stated in the scoring_definition.csv. Are both of them supposed to be written in the utils/train_ligand_binding_model.py? I have edited the script to write the smiles used to generate the morgan fingerprint and to train the random forest, as shown below:

def train_ligand_binding_model(target_unit_pro_id,binding_db_path,output_path):

original code

# convert to fingerprint
fps = []
values = []
valid_smiles = []
for x,y in d[['smiles','metric_value']].values:
    try:
        fp = AllChem.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(x),2)
    except:
        continue
    valid_smiles.append(x)
    fps.append(fp)
    values.append(y)

X = np.array(fps)
y = np.array(values)
#modified to prevent ValueError: y value has inf
try:
    regr = RandomForestRegressor(n_estimators=1000,random_state=0,n_jobs=-1)
    regr.fit(X,y)
    regr.score(X,y)
except ValueError:
    y = np.nan_to_num(y.astype(np.float32))
    regr = RandomForestRegressor(n_estimators=1000,random_state=0,n_jobs=-1)
    regr.fit(X,y)
    regr.score(X,y)

logging.debug(regr.score(X,y))

if output_path is None:
    output_path = f'{target_unit_pro_id}_rfr_ligand_model.pt'

output_smiles_path = f'{target_unit_pro_id}_ligand_smiles.txt'

with open(output_path, 'wb') as handle:
    s = pickle.dump(regr, handle)

with open(output_smiles_path, 'w') as smile_handle:
    for smi in valid_smiles:
        smile_handle.writelines(f'{smi}\n')

return 1

Thanks!

— Reply to this email directly, view it on GitHub https://github.com/bpmunson/polygon/issues/4, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA37E56S3G27QDSMLGIVLMLZKIIWTAVCNFSM6AAAAABKGUNNUKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGM4DKMBYGM2TMNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Feriolet commented 3 days ago

For Pharos, do you mean this website? image

For BindingDB, is this the correct website and query? image

Then, should I download the SMILES and combine them together to get the desired ../data/MTOR_ligand_smiles.txt and replicate the Supplementary Figure 6. result?