HelloJocelynLu / t5chem

Transformer-based model for chemical reactions
MIT License
58 stars 14 forks source link

Multi-task code #20

Open WangYitian123 opened 1 week ago

WangYitian123 commented 1 week ago

Hi,

I would like to ask if it is still necessary to use prefixes to differentiate tasks when using a multi-task approach. I found a file named MultiTask.py,it seems without use of the prefix.

Thanks a lot.

HelloJocelynLu commented 1 week ago

Hi, I believe assigning task_type will automatically add the corresponding prefixes to your input so short answer: No. Reference: https://github.com/HelloJocelynLu/t5chem/blob/8e97bcb7049fbb63206b1c586cb67cd4e23e20f8/t5chem/run_trainer.py#L173.

However, I'm not certain whether MultiTask.py can run smoothly, as it has been archived and is not used in the manuscript. (I did use the script to explore some reviewers' questions though.)

WangYitian123 commented 1 week ago

Hi, Thanks a lot. I also noticed that you mixed training sets of “Product”, ”Reactants”, and “Reagents” in the mixed folder. Does it also belong to a multitask way?

WangYitian123 commented 1 week ago

And if I want to train my data in this way, should I add the prefix in my data in advance?

HelloJocelynLu commented 1 week ago

Hi, Thanks a lot. I also noticed that you mixed training sets of “Product”, ”Reactants”, and “Reagents” in the mixed folder. Does it also belong to a multitask way?

I don't think so. The "mixed" version only combines all seq2seq tasks (forward prediction, reagent prediction, and retrosynthesis). However, the Multitask approach combines regression and classification tasks with adjustable weights assigned to each type of task. From my experience, training with seq2seq, classification, and regression together does not outperform seq2seq alone and has low GPU efficiency. Therefore, I suggest using "--task_type mixed" for t5chem/run_trainer.py instead of using MultiTask.py.

And if I want to train my data in this way, should I add the prefix in my data in advance?

For multitask one, if you used the MutiTask.py script, no. Reference: https://github.com/HelloJocelynLu/t5chem/blob/8e97bcb7049fbb63206b1c586cb67cd4e23e20f8/t5chem/archived/MultiTask.py#L231

In the case of the mixed approach with my data using "--task_type mixed", no prefix is needed (you will see this in the data files used for mixed training).

However, for your own customized data and tasks, adding a prefix is necessary.

WangYitian123 commented 1 week ago

Hi, Thanks for your reply. But all my tasks are seq2seq, if I want to train in multitask way, what should I do?

HelloJocelynLu commented 1 week ago

Hi, Thanks for your reply. But all my tasks are seq2seq, if I want to train in multitask way, what should I do?

Then mixed training is the way to go. Prepare your dataset as shown in USPTRO_500_MT:

x data/USPTO_500_MT/mixed/
x data/USPTO_500_MT/mixed/val.target
x data/USPTO_500_MT/mixed/train.source
x data/USPTO_500_MT/mixed/train.target
x data/USPTO_500_MT/mixed/val.source

Be sure to include the necessary prefixes.

image

Then run (adjust the path and parameters as needed): https://yzhang.hpc.nyu.edu/T5Chem/tutorial.html

t5chem train --data_dir path/to/your/data/folder/ --output_dir model/ --task_type mixed --pretrain models/pretrain/simple/ --num_epoch 30
WangYitian123 commented 1 week ago

Hi, image

I want to know if I should add the prefixes to the vocabulary. If I set the prefix as mixed, it indeed contains several different prefixes.

I also checked the content of t5chem/vocab/simple.pt; it does not contain the special tokens of the prefix.

The file mol_tokenizers.py defines the TASK_PREFIX. Can I just modify the list beause my data are also smiles of the molecules?

image

WangYitian123 commented 1 week ago

Hi,

And if I want to train a pretrain model by myself, can I donot add a prefix, which is different from the finetune data with several prefixes for different tasks.

Thank you very much. Thanks for your valuable suggestions.

HelloJocelynLu commented 1 week ago

Hi, image

I want to know if I should add the prefixes to the vocabulary.

Yes and no. Yes if you want to save some efforts on preparing dataset (By setting here the code automatically prepend the prefix for you). Otherwise simply includes whatever prefixes in your input dataset would be sufficient if I set the prefix as mixed.

I also checked the content of t5chem/vocab/simple.pt; it does not contain the special tokens of the prefix.

The file mol_tokenizers.py defines the TASK_PREFIX. Can I just modify the list beause my data are also smiles of the molecules?

image

Yes, if you want to introduce new prefixes.

HelloJocelynLu commented 1 week ago

Hi,

And if I want to train a pretrain model by myself, can I donot add a prefix, which is different from the finetune data with several prefixes for different tasks.

Thank you very much. Thanks for your valuable suggestions.

Yes. Sure. I pretrained the model using PubChem molecules SMILES without any prefixes.

WangYitian123 commented 1 week ago

Yes. Sure. I pretrained the model using PubChem molecules SMILES without any prefixes.

Yes. I found your pretrained model following the way of masked language model (MLM). But my pretrain data is also the reaction data, as same as your finetune data format. I think I should also split the source and target files.

WangYitian123 commented 1 week ago

Sorry to bother you. When I ran the code run_trainer.py using the command " python t5chem/run_trainer.py --data_dir data/sample/fusion/ --output_dir model_fusion_1times/ --task_type mixed --num_epoch 3000" with my own data, it appeared the error like follows: image I ran twice and found that when it began to print the {'eval_loss': 0.19883787631988525, 'eval_accuracy': 0.0010408921933085502, 'epoch': 1.9}, it got wrong. I hope you can give me some suggestions. Thanks very much. best wishes.

HelloJocelynLu commented 1 week ago

Yes. Sure. I pretrained the model using PubChem molecules SMILES without any prefixes.

Yes. I found your pretrained model following the way of masked language model (MLM). But my pretrain data is also the reaction data, as same as your finetune data format. I think I should also split the source and target files.

Yes I use MLM for pre-training, which is consistent with original t5 paper

HelloJocelynLu commented 1 week ago

Sorry to bother you. When I ran the code run_trainer.py using the command " python t5chem/run_trainer.py --data_dir data/sample/fusion/ --output_dir model_fusion_1times/ --task_type mixed --num_epoch 3000" with my own data, it appeared the error like follows: image I ran twice and found that when it began to print the {'eval_loss': 0.19883787631988525, 'eval_accuracy': 0.0010408921933085502, 'epoch': 1.9}, it got wrong. I hope you can give me some suggestions. Thanks very much. best wishes.

Have you ever tried --task_type mixed on data we provided? I've never encountered this error. The accuracy seems incorrect as well (unusually low). My recommendations would be:

  1. Test this task on our datasets (even a subset would work).
  2. Verify your data thoroughly. It looks like that the model stops at the same place twice. You can insert try/except to capture the problematic entry. (It appears there may be empty entries causing this issue.)
  3. I recommend training from the pretrained model with --pretrain (pubchem pretrained model, downloadable here). This approach significantly accelerates your training process.