Shark-NLP / DiffuSeq

[ICLR'23] DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models
MIT License
711 stars 87 forks source link
diffusion-models sequence-to-sequence text-generation

DiffuSeq

Official Codebase for __DiffuSeq__: Sequence to Sequence Text Generation With Diffusion Models and __DiffuSeq-v2__: Bridging Discrete and Continuous Text Spaces for Accelerated Seq2Seq Diffusion Models.

The diffusion process of our conditional diffusion language model DiffuSeq.

The diffusion process of accelerated DiffuSeq.

Highlights

Our study addresses promising achievements by such a new sequence-to-sequence learning paradigm.

Update: Our enhanced version effectively accelerates the training convergence by 4x and generates samples of similar quality 800x faster, rendering it significantly closer to practical application.

Setup:

The code is based on PyTorch and HuggingFace transformers.

pip install -r requirements.txt 

Datasets

Prepare datasets and put them under the datasets folder. Take datasets/CommonsenseConversation/train.jsonl as an example. We use four datasets in our paper.

Task Datasets Training Samples Source Used in DiffuSeq
Open-domain Dialogue Commonsense Conversation 3382k CCM download
Question Generation Quasar-T 117k OpenQA download
Text Simplification Wiki-alignment 677k Wiki-auto download
Paraphrase QQP 144k Kaggle download

DiffuSeq Training

cd scripts
bash train.sh

Arguments explanation:

It will take 2 more days to train a DiffuSeq model on 4 NVIDIA A100 80G GPUs for QG and QQP, and the training steps should be increased accordingly along with the size of the training set. To reproduce the results of Table 1 in our paper, we suggest the following configuration for each dataset when training.

Update:

Additional argument:

It only take around 11 hours to train a model on 2 NVIDIA A100 80G GPUs for QQP.

python -m torch.distributed.launch --nproc_per_node=4 --master_port=12233 --use_env run_train.py --diff_steps 2000 --lr 0.0001 --learning_steps 50000 --save_interval 10000 --seed 102 --noise_schedule sqrt --hidden_dim 128 --bsz 2048 --dataset qqp --data_dir {datasets/QQP} --vocab bert --seq_len 128 --schedule_sampler lossaware --notes qqp

python -m torch.distributed.launch --nproc_per_node=4 --master_port=12233 --use_env run_train.py --diff_steps 2000 --lr 0.0001 --learning_steps 40000 --save_interval 2000 --seed 102 --noise_schedule sqrt --hidden_dim 128 --bsz 2048 --microbatch 64 --dataset qg --data_dir {datasets/QG} --vocab bert --seq_len 128 --schedule_sampler lossaware --notes qg

python -m torch.distributed.launch --nproc_per_node=7 --master_port=12233 --use_env run_train.py --diff_steps 2000 --lr 0.0001 --learning_steps 140000 --save_interval 20000 --seed 102 --noise_schedule sqrt --hidden_dim 128 --bsz 2048 --microbatch 64 --dataset dialogue --data_dir {datasets/Conversation} --vocab bert --seq_len 128 --schedule_sampler lossaware --notes dialogue

python -m torch.distributed.launch --nproc_per_node=8 --master_port=12233 --use_env run_train.py --diff_steps 2000 --lr 0.0001 --learning_steps 80000 --save_interval 20000 --seed 102 --noise_schedule sqrt --hidden_dim 128 --bsz 2048 --microbatch 64 --dataset dialogue --data_dir {datasets/TS} --vocab bert --seq_len 128 --schedule_sampler lossaware --notes ts

Empirically, larger batchsize (larger microbatch here) can achieve higher BLEU score (without MBR). If you want to sync training loss to wandb, please customize your wandb setting in train.py (add your own API KEY).

DiffuSeq Decoding

You need to modify the path to model_dir, which is obtained in the training stage.

cd scripts
bash run_decode.sh

To reproduce the results of Table 1 in our paper, we suggest the size of MBR candidate set to be 10 (run 10 times using different seeds). Empirically, larger size can achieve higher BLEU score. For diversity metrics, the size of MBR candidate set is 3 when computing.

Speed-up Decoding

We customize the implementation of DPM-Solver++ to DiffuSeq to accelerate its sampling speed.

cd scripts
bash run_decode_solver.sh

Evaluation & MBR

You need to specify the folder of decoded texts. This folder should contain the decoded files from the same model but sampling with different random seeds. If mbr is not attached, we will compute the diversity score from the files in the folder, otherwise we will do MBR decoding:

cd scripts
python eval_seq2seq.py --folder ../{your-path-to-outputs} --mbr

Note: if you want to use this evaluation script for output files from other models, please make sure the same line from these output files refers to the same piece of data. Otherwise the diversity score could be incorrect.

Update

Welcome to discuss if you have any questions.

Citation

Please add the citation if our paper or code helps you.

@inproceedings{gong2022diffuseq,
  author = {Gong, Shansan and Li, Mukai and Feng, Jiangtao and Wu, Zhiyong and Kong, Lingpeng},
  booktitle = {International Conference on Learning Representations, ICLR},
  title = {{DiffuSeq}: Sequence to Sequence Text Generation with Diffusion Models},
  year = 2023
}

@article{gong2023diffuseqv2,
  title={DiffuSeq-v2: Bridging Discrete and Continuous Text Spaces for Accelerated Seq2Seq Diffusion Models},
  author={Gong, Shansan and Li, Mukai and Feng, Jiangtao and Wu, Zhiyong and Kong, Lingpeng},
  journal={arXiv preprint arXiv:2310.05793},
  year={2023}
}

DiffuSeq poster for ICLR 2023.