Closed WangYitian123 closed 3 months ago
Hi, I believe assigning task_type will automatically add the corresponding prefixes to your input so short answer: No. Reference: https://github.com/HelloJocelynLu/t5chem/blob/8e97bcb7049fbb63206b1c586cb67cd4e23e20f8/t5chem/run_trainer.py#L173.
However, I'm not certain whether MultiTask.py can run smoothly, as it has been archived and is not used in the manuscript. (I did use the script to explore some reviewers' questions though.)
Hi, Thanks a lot. I also noticed that you mixed training sets of “Product”, ”Reactants”, and “Reagents” in the mixed folder. Does it also belong to a multitask way?
And if I want to train my data in this way, should I add the prefix in my data in advance?
Hi, Thanks a lot. I also noticed that you mixed training sets of “Product”, ”Reactants”, and “Reagents” in the mixed folder. Does it also belong to a multitask way?
I don't think so. The "mixed" version only combines all seq2seq tasks (forward prediction, reagent prediction, and retrosynthesis). However, the Multitask approach combines regression and classification tasks with adjustable weights assigned to each type of task. From my experience, training with seq2seq, classification, and regression together does not outperform seq2seq alone and has low GPU efficiency. Therefore, I suggest using "--task_type mixed" for t5chem/run_trainer.py instead of using MultiTask.py.
And if I want to train my data in this way, should I add the prefix in my data in advance?
For multitask one, if you used the MutiTask.py script, no. Reference: https://github.com/HelloJocelynLu/t5chem/blob/8e97bcb7049fbb63206b1c586cb67cd4e23e20f8/t5chem/archived/MultiTask.py#L231
In the case of the mixed approach with my data using "--task_type mixed", no prefix is needed (you will see this in the data files used for mixed training).
However, for your own customized data and tasks, adding a prefix is necessary.
Hi, Thanks for your reply. But all my tasks are seq2seq, if I want to train in multitask way, what should I do?
Hi, Thanks for your reply. But all my tasks are seq2seq, if I want to train in multitask way, what should I do?
Then mixed training is the way to go. Prepare your dataset as shown in USPTRO_500_MT:
x data/USPTO_500_MT/mixed/
x data/USPTO_500_MT/mixed/val.target
x data/USPTO_500_MT/mixed/train.source
x data/USPTO_500_MT/mixed/train.target
x data/USPTO_500_MT/mixed/val.source
Be sure to include the necessary prefixes.
Then run (adjust the path and parameters as needed): https://yzhang.hpc.nyu.edu/T5Chem/tutorial.html
t5chem train --data_dir path/to/your/data/folder/ --output_dir model/ --task_type mixed --pretrain models/pretrain/simple/ --num_epoch 30
Hi,
I want to know if I should add the prefixes to the vocabulary. If I set the prefix as mixed, it indeed contains several different prefixes.
I also checked the content of t5chem/vocab/simple.pt; it does not contain the special tokens of the prefix.
The file mol_tokenizers.py defines the TASK_PREFIX. Can I just modify the list beause my data are also smiles of the molecules?
Hi,
And if I want to train a pretrain model by myself, can I donot add a prefix, which is different from the finetune data with several prefixes for different tasks.
Thank you very much. Thanks for your valuable suggestions.
Hi,
I want to know if I should add the prefixes to the vocabulary.
Yes and no. Yes if you want to save some efforts on preparing dataset (By setting here the code automatically prepend the prefix for you). Otherwise simply includes whatever prefixes in your input dataset would be sufficient if I set the prefix as mixed.
I also checked the content of t5chem/vocab/simple.pt; it does not contain the special tokens of the prefix.
The file mol_tokenizers.py defines the TASK_PREFIX. Can I just modify the list beause my data are also smiles of the molecules?
Yes, if you want to introduce new prefixes.
Hi,
And if I want to train a pretrain model by myself, can I donot add a prefix, which is different from the finetune data with several prefixes for different tasks.
Thank you very much. Thanks for your valuable suggestions.
Yes. Sure. I pretrained the model using PubChem molecules SMILES without any prefixes.
Yes. Sure. I pretrained the model using PubChem molecules SMILES without any prefixes.
Yes. I found your pretrained model following the way of masked language model (MLM). But my pretrain data is also the reaction data, as same as your finetune data format. I think I should also split the source and target files.
Sorry to bother you. When I ran the code run_trainer.py using the command " python t5chem/run_trainer.py --data_dir data/sample/fusion/ --output_dir model_fusion_1times/ --task_type mixed --num_epoch 3000" with my own data, it appeared the error like follows: I ran twice and found that when it began to print the {'eval_loss': 0.19883787631988525, 'eval_accuracy': 0.0010408921933085502, 'epoch': 1.9}, it got wrong. I hope you can give me some suggestions. Thanks very much. best wishes.
Yes. Sure. I pretrained the model using PubChem molecules SMILES without any prefixes.
Yes. I found your pretrained model following the way of masked language model (MLM). But my pretrain data is also the reaction data, as same as your finetune data format. I think I should also split the source and target files.
Yes I use MLM for pre-training, which is consistent with original t5 paper
Sorry to bother you. When I ran the code run_trainer.py using the command " python t5chem/run_trainer.py --data_dir data/sample/fusion/ --output_dir model_fusion_1times/ --task_type mixed --num_epoch 3000" with my own data, it appeared the error like follows: I ran twice and found that when it began to print the {'eval_loss': 0.19883787631988525, 'eval_accuracy': 0.0010408921933085502, 'epoch': 1.9}, it got wrong. I hope you can give me some suggestions. Thanks very much. best wishes.
Have you ever tried --task_type mixed on data we provided? I've never encountered this error. The accuracy seems incorrect as well (unusually low). My recommendations would be:
Hi any updates on this?
Closed due to inactivity. Please reopen if necessary.
Hi,
I would like to ask if it is still necessary to use prefixes to differentiate tasks when using a multi-task approach. I found a file named MultiTask.py,it seems without use of the prefix.
Thanks a lot.