chao1224 / MoleculeSTM

Multi-modal Molecule Structure-text Model for Text-based Editing and Retrieval, Nat Mach Intell 2023 (https://www.nature.com/articles/s42256-023-00759-6)
https://chao1224.github.io/MoleculeSTM
Other
201 stars 18 forks source link

How can I reproduce the results from the article? #16

Open yxliu0907 opened 9 months ago

yxliu0907 commented 9 months ago

作者你好!我在使用您给出的code和checkpoint进行molecule editing,但是我使用默认参数似乎无法复现出文章里给出的结果,请问是我的哪些参数设置有问题吗?:)

yxliu0907 commented 9 months ago

比如我想复现p1这个结果,运行代码得到的结果为p2,得到的smiles似乎不是p1中所给出的那样。我使用的checkpoint是'MoleculeSTM/pretrained_MoleculeSTM/SciBERT-Graph-3e-5-1-1e-4-1-InfoNCE-0.1-32-32',输入的text是'This molecule issoluble in water.',输入的SMILES是FC(F)(F)OC(C=C1)=CC=C1C(C=N2)=CC=C2OC‘’ 46acc857e47c7c8c14eb51f881a066a 7ef93161fa7e9f9dd3ffdeb6930d7f8

chao1224 commented 9 months ago

Hi @yxliu0907,

Thank you for raising this question. The results are reproducible if you follow :

yxliu0907 commented 9 months ago

Many thanks for your advice! I followed your lead: using the checkpoints you mentionedpretrained_MoleculeSTM/SciBERT-Graph-3e-5-1-1e-4-1-EBM_NCE-0.1-32-32, using the text prompt you mentionedThis molecule insoluble in water and using the canonical smiles on the listCOc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 , but I still can't reproduce it. The smiles I get are either the same smiles or invalid smiles. 1a39fb585193111df90fab76bb5820f

chao1224 commented 9 months ago

Hi @yxliu0907,

We just checked the log files, and here are more details.

The result w.r.t. this subfigure is:

l2 lambda: 0.1
Use random noise for init
clip loss: -0.96586 L2 loss: 0.07372
WARNING:foundation.models.mega_molbart.mega_mol_bart:WARNING: MOLECULE VALIDATION AND SANITIZATION CURRENTLY DISABLED
SMILES_list: ['COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1cnc(-c2ccc(OC(F)(F)F)cc2)cn1']
valid mol list: 3
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1cnc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.05080

If you use our script (with all 200 SMILES as inputs), more complete results for this molecule are:

===== for SMILES COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 =====
Use random noise for init
l2 lambda: 10.0
Use random noise for init
clip loss: -0.13243 L2 loss: 0.09747
WARNING:foundation.models.mega_molbart.mega_mol_bart:WARNING: MOLECULE VALIDATION AND SANITIZATION CURRENTLY DISABLED
SMILES_list: ['COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1']
valid mol list: 3
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580

l2 lambda: 1.0
Use random noise for init
clip loss: -0.94003 L2 loss: 0.18903
WARNING:foundation.models.mega_molbart.mega_mol_bart:WARNING: MOLECULE VALIDATION AND SANITIZATION CURRENTLY DISABLED
SMILES_list: ['COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ncc(-c2ccc(OC(F)(F)F)cc2)cc1-c1cnn(C)c1']
valid mol list: 3
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ncc(-c2ccc(OC(F)(F)F)cc2)cc1-c1cnn(C)c1 & 4.05630

l2 lambda: 0.1
Use random noise for init
clip loss: -0.96586 L2 loss: 0.07372
WARNING:foundation.models.mega_molbart.mega_mol_bart:WARNING: MOLECULE VALIDATION AND SANITIZATION CURRENTLY DISABLED
SMILES_list: ['COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1cnc(-c2ccc(OC(F)(F)F)cc2)cn1']
valid mol list: 3
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1cnc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.05080

l2 lambda: 0.01
Use random noise for init
clip loss: -0.94089 L2 loss: 0.02474
WARNING:foundation.models.mega_molbart.mega_mol_bart:WARNING: MOLECULE VALIDATION AND SANITIZATION CURRENTLY DISABLED
SMILES_list: ['COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'C(OC(F)(F)F)(=O)N[C@H](C)CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)CCCCCCC)COCCC)CCCCOCCCCCOCOCOCOCOCOCOCCCCCCCCOCOCOCCCCCCCCCCCCCCCCCCCCC(=COCCCCCCCCCCCC(=COCOC(=OC(=COCCCCCCOCOCCCCCCCCCCCCCCCCCCC(=OC(=OCCCCCCCCCCCCC(=OC(=OC(=OCC(=C(=OCOCOC(=OC(=OC(=OC(=OC(=OCC(=OCCCCCCC(=OC(=OC(=OC(=OC(=OC(=OCC(=OC(=OC)(=OC(=OC)C)C)(=OC(=OC)C)C(=OC)C)C)C)C)COC(=OC)(=OC)(=OC)C(=C(=C)C)CCOC(=OC(=OCC(=OC(=OC)C(=OC(=OC)C(=OC(=OCOCOCOC(=OC)(=C)(=OC)C(=OC(F)(=OC(=OC)(=OC(=OC(=OCOC)(=C)(']
valid mol list: 2

l2 lambda: 0.001
Use random noise for init
clip loss: -0.93510 L2 loss: 0.00295
WARNING:foundation.models.mega_molbart.mega_mol_bart:WARNING: MOLECULE VALIDATION AND SANITIZATION CURRENTLY DISABLED
SMILES_list: ['COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'C(OC(F)(F)C)[C@H]1C[C@H]1CC[C@H]1CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCOCOCOCOCOCOC)CCCCOCOCOCCCOCOCOCOCOCOCCCCCCCCOCOCOCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCOCOCCCCCOCOCCCCCCCCOCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCOCOCCCCCCCCCCOCOCOCOCOCOCCCCCCCCC(=COCOC(=CCCCCCCCCCCCCCCCCCCCCCOCCCCCCCCCCCCCCCCOCOCOCOCOCOCCCCCCCCCOCOCCCCCCCCCC(=CCCCCCC(=COC(=COCOCOCOCOCC(=C(=COCCCCCOCOCOCOCCCCCOCOC(=COCOC(=C(=CCCCC(=C(=C(=COCOC(=C(=C(=CCCCCC(=C(=C(FC(=C(F)(F)C(=C(=C(=C(=C(=C(=C']
valid mol list: 2
yxliu0907 commented 8 months ago

I'm really sorry, but I still can't reproduce the same results.😭 Am I using other incorrect parameters? Here are my parameter Settings:

parser = argparse.ArgumentParser() parser.add_argument("--seed", type=int, default=42) parser.add_argument("--device", type=int, default=0) parser.add_argument("--verbose", type=int, default=1)

########## for editing ##########
parser.add_argument("--input_description", type=str, default='This molecule is insoluble in water')
parser.add_argument("--input_description_id", type=int, default=None)
parser.add_argument("--input_SMILES", type=str, default='COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1')
parser.add_argument("--input_SMILES_file", type=str, default=None)
parser.add_argument("--output_model_dir", type=str, default=None)
parser.add_argument("--use_noise_for_init", dest="use_noise_for_init", action="store_true")
parser.add_argument("--no_noise_for_init", dest="use_noise_for_init", action="store_false")
parser.set_defaults(use_noise_for_init=True)
parser.add_argument('--normalize', dest='normalize', action='store_true')
parser.add_argument('--no_normalize', dest='normalize', action='store_false')
parser.set_defaults(normalize=True)

parser.add_argument("--dataspace_path", type=str, default="../data")
parser.add_argument("--SSL_emb_dim", type=int, default=256)
parser.add_argument("--max_seq_len", type=int, default=512)

########## for MoleculeSTM ##########
parser.add_argument("--MoleculeSTM_model_dir", type=str, default="../model_save")
parser.add_argument("--MoleculeSTM_molecule_type", type=str, default="SMILES", choices=["SMILES", "Graph"])

########## for MegaMolBART ##########
parser.add_argument("--MegaMolBART_generation_model_dir", type=str, default="../data/pretrained_MegaMolBART/checkpoints")
parser.add_argument("--vocab_path", type=str, default="../MoleculeSTM/bart_vocab.txt")

########## for MoleculeSTM and generation projection ##########
parser.add_argument("--language_edit_model_dir", type=str, default="../model_save")   

########## for editing ##########
parser.add_argument("--lr_rampup", type=float, default=0.05)
parser.add_argument("--lr", type=float, default=0.1)
parser.add_argument("--epochs", type=int, default=50)
args = parser.parse_args()

and here is my result:

description_list ['This molecule is insoluble in water']
===== for description This molecule is insoluble in water =====
===== for SMILES COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 =====
Use random noise for init
l2 lambda: 10.0
Use random noise for init

  0%|          | 0/50 [00:00<?, ?it/s]
 10%|█         | 5/50 [00:00<00:01, 41.95it/s]
 22%|██▏       | 11/50 [00:00<00:00, 49.55it/s]
 34%|███▍      | 17/50 [00:00<00:00, 53.23it/s]
 48%|████▊     | 24/50 [00:00<00:00, 56.40it/s]
 62%|██████▏   | 31/50 [00:00<00:00, 58.02it/s]
 76%|███████▌  | 38/50 [00:00<00:00, 58.97it/s]
 90%|█████████ | 45/50 [00:00<00:00, 59.56it/s]
100%|██████████| 50/50 [00:00<00:00, 57.24it/s]
WARNING: MOLECULE VALIDATION AND SANITIZATION CURRENTLY DISABLED
clip loss: 0.07312  L2 loss: 0.13768
SMILES_list: ['COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1']
valid mol list: 3
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580

l2 lambda: 1.0
Use random noise for init

  0%|          | 0/50 [00:00<?, ?it/s]
 14%|█▍        | 7/50 [00:00<00:00, 60.37it/s]
 28%|██▊       | 14/50 [00:00<00:00, 60.68it/s]
 42%|████▏     | 21/50 [00:00<00:00, 60.80it/s]
 56%|█████▌    | 28/50 [00:00<00:00, 60.88it/s]
 70%|███████   | 35/50 [00:00<00:00, 60.83it/s]
 84%|████████▍ | 42/50 [00:00<00:00, 60.78it/s]
 98%|█████████▊| 49/50 [00:00<00:00, 60.82it/s]
100%|██████████| 50/50 [00:00<00:00, 60.77it/s]
WARNING: MOLECULE VALIDATION AND SANITIZATION CURRENTLY DISABLED
clip loss: -0.43997 L2 loss: 0.17291
SMILES_list: ['COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2cc(-c3ccc(OC(F)(F)F)cc3)cnc2OC)cn1']
valid mol list: 3
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ccc(-c2cc(-c3ccc(OC(F)(F)F)cc3)cnc2OC)cn1 & 4.72640

l2 lambda: 0.1
Use random noise for init

  0%|          | 0/50 [00:00<?, ?it/s]
 14%|█▍        | 7/50 [00:00<00:00, 60.72it/s]
 28%|██▊       | 14/50 [00:00<00:00, 60.82it/s]
 42%|████▏     | 21/50 [00:00<00:00, 60.92it/s]
 56%|█████▌    | 28/50 [00:00<00:00, 60.91it/s]
 70%|███████   | 35/50 [00:00<00:00, 60.96it/s]
 84%|████████▍ | 42/50 [00:00<00:00, 60.98it/s]
 98%|█████████▊| 49/50 [00:00<00:00, 60.99it/s]
100%|██████████| 50/50 [00:00<00:00, 60.93it/s]
WARNING: MOLECULE VALIDATION AND SANITIZATION CURRENTLY DISABLED
clip loss: -0.38958 L2 loss: 0.09160
SMILES_list: ['COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1']
valid mol list: 3
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580
COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1 & 3.65580

l2 lambda: 0.01
Use random noise for init

  0%|          | 0/50 [00:00<?, ?it/s]
 14%|█▍        | 7/50 [00:00<00:00, 60.73it/s]
 28%|██▊       | 14/50 [00:00<00:00, 60.81it/s]
 42%|████▏     | 21/50 [00:00<00:00, 60.88it/s]
 56%|█████▌    | 28/50 [00:00<00:00, 60.97it/s]
 70%|███████   | 35/50 [00:00<00:00, 60.96it/s]
 84%|████████▍ | 42/50 [00:00<00:00, 60.97it/s]
 98%|█████████▊| 49/50 [00:00<00:00, 60.86it/s]
100%|██████████| 50/50 [00:00<00:00, 60.87it/s]
WARNING: MOLECULE VALIDATION AND SANITIZATION CURRENTLY DISABLED
clip loss: -0.35857 L2 loss: 0.02138
SMILES_list: ['COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'C(F)(F)(F)(F)OCCN[C@H](C)CCOCC)c1ccc(-c2cnn(C)c2)n[nH]1']
valid mol list: 2

l2 lambda: 0.001
Use random noise for init

  0%|          | 0/50 [00:00<?, ?it/s]
 14%|█▍        | 7/50 [00:00<00:00, 60.65it/s]
 28%|██▊       | 14/50 [00:00<00:00, 60.78it/s]
 42%|████▏     | 21/50 [00:00<00:00, 60.89it/s]
 56%|█████▌    | 28/50 [00:00<00:00, 60.91it/s]
 70%|███████   | 35/50 [00:00<00:00, 60.91it/s]
 84%|████████▍ | 42/50 [00:00<00:00, 60.92it/s]
 98%|█████████▊| 49/50 [00:00<00:00, 60.95it/s]
100%|██████████| 50/50 [00:00<00:00, 60.89it/s]
WARNING: MOLECULE VALIDATION AND SANITIZATION CURRENTLY DISABLED
clip loss: -0.35523 L2 loss: 0.00235
SMILES_list: ['COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'COc1ccc(-c2ccc(OC(F)(F)F)cc2)cn1', 'C(F)(F)(F)(F)F']
valid mol list: 2

result_eval_list_one_pair
 [[ True]]
chao1224 commented 8 months ago

Hi @yxliu0907,

It seems that you are using This molecule is insoluble in water, not soluble, which might be the issue.

For insoluble, the result with l2-lambda=1 gives the right answer.

yxliu0907 commented 8 months ago

Hello @chao1224! I think it is caused by random seeds, which random seed have been used?😶‍🌫️😶‍🌫️😶‍🌫️

chao1224 commented 8 months ago

@yxliu0907

The random seed is 1.

AmT42 commented 5 months ago

Hey Chao, Do you have the right hyperparameters for different tasks of editing, please? It's about this answer you gave: '> For insoluble, the result with l2-lambda=1 gives the right answer.' I imagine you also have this kind of optimization to be done for editing for binding, multi-objective, or drug-like?

chao1224 commented 5 months ago

Hi @AmT42

Yes, I have them in the log files. I will add them ASAP.