bpmunson / polygon

POLYGON VAE For de novo Polypharmacology
MIT License
27 stars 8 forks source link

Error in chemical embedding to design polypharmacology compounds #7

Open amkrishnan28 opened 1 month ago

amkrishnan28 commented 1 month ago

Hi, I'm trying to run the command: polygon generate \ --model_path ../data/pretrained_vae_model.pt \ --scoring_definition scoring_definition.csv \ --max_len 100 \ --n_epochs 200 \ --mols_to_sample 8192 \ --optimize_batch_size 512 \ --optimize_n_epochs 2 \ --keep_top 4096 \ --opti gauss \ --outF molecular_generation \ --device cpu \ --save_payloads \ --n_jobs 4 \ --debug The first three commands work, but when I try to run this command, I get this error:
SMILES Parse Error: syntax error while parsing: O=C(c1cn(-c2ccccc2)nc1-c1cccc(F)c1)N1CCC(C2)c2cc(O)c(O)c(C)c2C1 [11:46:29] SMILES Parse Error: Failed parsing SMILES 'O=C(c1cn(-c2ccccc2)nc1-c1cccc(F)c1)N1CCC(C2)c2cc(O)c(O)c(C)c2C1' for input: 'O=C(c1cn(-c2ccccc2)nc1-c1cccc(F)c1)N1CCC(C2)c2cc(O)c(O)c(C)c2C1' [11:46:29] SMILES Parse Error: syntax error while parsing: Cc1ccc(CNC23CC4CC(CC(C4)C2)C3)cc1 [11:46:29] SMILES Parse Error: Failed parsing SMILES 'Cc1ccc(CNC23CC4CC(CC(C4)C2)C3)cc1' for input: 'Cc1ccc(CNC23CC4CC(CC(C4)C2)C3)cc1' [11:46:29] SMILES Parse Error: syntax error while parsing: Cc1ccc(CNCCc2ccccc2C)nc1 [11:46:29] SMILES Parse Error: Failed parsing SMILES 'Cc1ccc(CNCCc2ccccc2C)nc1' for input: 'Cc1ccc(CNCCc2ccccc2C)nc1' [11:46:29] Can't kekulize mol. Unkekulized atoms: 17 18 19 20 25 [11:46:29] SMILES Parse Error: syntax error while parsing: COc1ccc2cc(C(=O)N3CCC(C(O)=NCc4cccc(C)c4)CC3)c(O)nc2c1 [11:46:29] SMILES Parse Error: Failed parsing SMILES 'COc1ccc2cc(C(=O)N3CCC(C(O)=NCc4cccc(C)c4)CC3)c(O)nc2c1' for input: 'COc1ccc2cc(C(=O)N3CCC(C(O)=NCc4cccc(C)c4)CC3)c(O)nc2c1' [11:46:29] SMILES Parse Error: syntax error while parsing: CC(=O)Nc1nc(C2CC2)c(C)c(-c2cccnc2Oc2ccc(C)cc2C)[nH]1 [11:46:29] SMILES Parse Error: Failed parsing SMILES 'CC(=O)Nc1nc(C2CC2)c(C)c(-c2cccnc2Oc2ccc(C)cc2C)[nH]1' for input: 'CC(=O)Nc1nc(C2CC2)c(C)c(-c2cccnc2Oc2ccc(C)cc2C)[nH]1' [11:46:29] SMILES Parse Error: syntax error while parsing: Cc1ccc([S]CCN2CCCCC2)cc1S(=O)(=O)N1CCCCCC1 [11:46:29] SMILES Parse Error: Failed parsing SMILES 'Cc1ccc([S]CCN2CCCCC2)cc1S(=O)(=O)N1CCCCCC1' for input: 'Cc1ccc([S]CCN2CCCCC2)cc1S(=O)(=O)N1CCCCCC1' [11:46:29] SMILES Parse Error: syntax error while parsing: CC(=O)Nc1ccccc1N1CCN(c2nc(Nc3ccccc3C)nc(N3CCCCC3)n2)CC1 [11:46:29] SMILES Parse Error: Failed parsing SMILES 'CC(=O)Nc1ccccc1N1CCN(c2nc(Nc3ccccc3C)nc(N3CCCCC3)n2)CC1' for input: 'CC(=O)Nc1ccccc1N1CCN(c2nc(Nc3ccccc3C)nc(N3CCCCC3)n2)CC1' [11:46:29] SMILES Parse Error: unclosed ring for input: 'COc1cc2c(cc1OC)C1=CC(O)=C3C(=O)CCC21' [11:46:29] SMILES Parse Error: unclosed ring for input: 'Cc1ccc2nc(C)c3c(c2c1)C(=CCSc1nnc(C)s1)CC(C)(C)NC(=N)C2' [11:46:29] SMILES Parse Error: syntax error while parsing: O=C(NCCc1cccc(C)c1)Nc1ccc2nc(C(F)(F)F)no2c1 [11:46:29] SMILES Parse Error: Failed parsing SMILES 'O=C(NCCc1cccc(C)c1)Nc1ccc2nc(C(F)(F)F)no2c1' for input: 'O=C(NCCc1cccc(C)c1)Nc1ccc2nc(C(F)(F)F)no2c1' [11:46:29] SMILES Parse Error: unclosed ring for input: 'COc1ccc(C2=C(C#N)C(c3c(-c4ccccc5)c[nH]c4c3C(C)CN3C(=O)C3CCC4C3C)CC2(C)C)cc1OC' [11:46:29] SMILES Parse Error: syntax error while parsing: O=C1c2cc3c(cc2C(=O)N1C(=O)c1cccc(C)c1C)NC1CC3N1 [11:46:29] SMILES Parse Error: Failed parsing SMILES 'O=C1c2cc3c(cc2C(=O)N1C(=O)c1cccc(C)c1C)NC1CC3N1' for input: 'O=C1c2cc3c(cc2C(=O)N1C(=O)c1cccc(C)c1C)NC1CC3N1' [11:46:29] SMILES Parse Error: syntax error while parsing: Bc1cnc2ccc(N3CCNCC3)nn12 [11:46:29] SMILES Parse Error: Failed parsing SMILES 'Bc1cnc2ccc(N3CCNCC3)nn12' for input: 'Bc1cnc2ccc(N3CCNCC3)nn12' [11:46:29] SMILES Parse Error: syntax error while parsing: O=C(Nc1ccc(C)cc1)OCC1CN(c2ccncc2O)CCO1 [11:46:29] SMILES Parse Error: Failed parsing SMILES 'O=C(Nc1ccc(C)cc1)OCC1CN(c2ccncc2O)CCO1' for input: 'O=C(Nc1ccc(C)cc1)OCC1CN(c2ccncc2O)CCO1' [11:46:29] SMILES Parse Error: syntax error while parsing: CCOc1ccc(CNC(=O)C2CC2)cc1NC(=O)CC(=O)Nc1cc(C)c(OCC)cc1F [11:46:29] SMILES Parse Error: Failed parsing SMILES 'CCOc1ccc(CNC(=O)C2CC2)cc1NC(=O)CC(=O)Nc1cc(C)c(OCC)cc1F' for input: 'CCOc1ccc(CNC(=O)C2CC2)cc1NC(=O)CC(=O)Nc1cc(C)c(OCC)cc1F' Here is my scoring_definition.csv: category,name,minimize,mu,sigma,file,model,n_top,agg qed,qed,False,0.67,0.1,,,, sa,sa,True,3,0.5,,,, latent_distance,MTOR_vae_dist,True,1.5,0.5,../data/P42345_ligand_smiles_filtered.txt,../data/pretrained_vae_model.pt,20.0,mean latent_distance,MEK1_vae_dist,True,1.5,0.5,../data/Q02750_ligand_smiles_filtered.txt,../data/pretrained_vae_model.pt,20.0,mean ligand_efficiency,MTOR_le,False,0.8,0.3,../data/P42345_ligand_binding.pkl,,, ligand_efficiency,MTEK1_le,False,0.8,0.3,../data/Q02750_ligand_binding.pkl,,, Could it be something to do with the ../data/pretrained_vae_model.pt file?

yanbosmu commented 1 month ago

pretrained_vae_model.pt should be model.pt file.

amkrishnan28 commented 1 month ago

I get the same error replacing

pretrained_vae_model.pt should be model.pt file.

yanbosmu commented 1 month ago

I believed that it was something wrong about SMILES recognizing module. In my case, I also got a lot error message about smiles, it seems that it doesn't recognizing some special atoms. But eventually I got the output. But not exactly the same as Supplementary Fig 6

amkrishnan28 commented 1 month ago

I believed that it was something wrong about SMILES recognizing module. In my case, I also got a lot error message about smiles, it seems that it doesn't recognizing some special atoms. But eventually I got the output. But not exactly the same as Supplementary Fig 6

How did you do this?

bpmunson commented 1 month ago

Hello,

This is the standard output sent to the terminal by the 'polygon generate' command with the "--debug" flag set. This is a very verbose output setting, detailing the process of generation.

These error are seeing are being produced by the "rdkit.Chem.MolFromSmiles" function that parses the smiles strings generated by sampling the latent space of the chemical embedding. This latent space is not perfectly "continuous", rather, there are some position in the chemical embedding that cannot be translated in valid smiles strings. RDkit is alerting for all the generated structure which could not be parsed into valid molecules from the decoded SMILES string.

However, I would not expect this to terminate the generation run. Did this command produced output files? Did you inspect those?

If you would like to not see these SMILES parsing errors try changing the "--debug" to "--verbose".

Best, Brenton