Paper accepted at the NAACL-HLT 2021:
AdaptSum: Towards Low-Resource Domain Adaptation for Abstractive Summarization, by Tiezheng Yu*, Zihan Liu*, Pascale Fung.
State-of-the-art abstractive summarization models generally rely on extensive labeled data, which lowers their generalization ability on domains where such data are not available. In this paper, we present a study of domain adaptation for the abstractive summarization task across six diverse target domains in a low-resource setting. Specifically, we investigate the second phase of pre-training on large-scale generative models under three different settings: 1) source domain pre-training; 2) domain-adaptive pre-training; and 3) task-adaptive pre-training. Experiments show that the effectiveness of pre-training is correlated with the similarity between the pre-training data and the target domain task. Moreover, we find that continuing pre-training could lead to the pre-trained model's catastrophic forgetting, and a learning method with less forgetting can alleviate this issue. Furthermore, results illustrate that a huge gap still exists between the low-resource and high-resource settings, which highlights the need for more advanced domain adaptation methods for the abstractive summarization task.
We release the AdaptSum dataset, which contains the summarization datasets across six target domains as well as the corpora for SDPT, DAPT and TAPT. You can download AdaptSum from Here.
dataset
at the root of this projectdataset
folderconda create -n adaptsum python=3.6
conda activate adaptsum
conda install pytorch cudatoolkit=11.0 -c pytorch
pip install -r requirements.txt
logs
at the root of this project
cnn_dm
as an exampleSDPT_save
at the root of this projectpython ./src/preprocessing.py -data_path=dataset/ \
-data_name=SDPT-cnn_dm \
-mode=train \
-batch_size=4
./scripts/sdpt_pretraining.sh
. You can add -recadam
and -logging_Euclid_dist
to use RecAdam.debate domain
as an example
DAPT_save
at the root of this project./scripts/dapt_pretraining.sh
. You can add -recadam
and -logging_Euclid_dist
to use RecAdam.debate domain
as an example
TAPT_save
at the root of this project./scripts/tapt_pretraining.sh
. You can add -recadam
and -logging_Euclid_dist
to use RecAdam.debate domain
as an example
debate
at logs
Prepare dataloader:
python ./src/preprocessing.py -data_path=dataset/ \
-data_name=debate \
-mode=train \
-batch_size=4
python ./src/preprocessing.py -data_path=dataset/ \
-data_name=debate \
-mode=valid \
-batch_size=4
python ./src/preprocessing.py -data_path=dataset/ \
-data_name=debate \
-mode=test \
-batch_size=4
Install pyrouge
package (You can skip this if you have already installed pyrouge
)
git clone https://github.com/bheinzerling/pyrouge
cd pyrouge
pip install -e .
git clone https://github.com/andersjo/pyrouge.git rouge
pyrouge_set_rouge_path ~/pyrouge/rouge/tools/ROUGE-1.5.5/
sudo apt-get install libxml-parser-perl
cd rouge/tools/ROUGE-1.5.5/data
rm WordNet-2.0.exc.db
./WordNet-2.0-Exceptions/buildExeptionDB.pl ./WordNet-2.0-Exceptions ./smart_common_words.txt ./WordNet-2.0.exc.db
python -m pyrouge.test
Run Finetuning
python ./src/run.py -visible_gpu=0 \
-data_name=debate \
-save_interval=100 \
-start_to_save_iter=3000
python ./src/run.py -visible_gpu=0 \
-data_name=debate \
-save_interval=100 \
-start_to_save_iter=3000 \
-pre_trained_src \
-train_from=YOUR_SAVED_CHECKPOINTS
python ./src/run.py -visible_gpu=0 \
-data_name=debate \
-save_interval=100 \
-start_to_save_iter=3000 \
-pre_trained_lm=YOUR_SAVED_CHECKPOINTS
Evaluate the performance
1) Make a folder named inference
at logs
2) You can do inference by
python ./src/inference.py -visible_gpu=0 -train_from=YOUR_SAVED_CHECKPOINT
3) You can calculate rouge scores by
python ./src/cal_roug.py -c=CANDIDATE_FILE -r=REFERENCE_FILE -p=NUMBER_OF_PROCESS
If you use our benchmark or the code in this repo, please cite our paper.
@inproceedings{Yu2021AdaptSum, title={AdaptSum: Towards Low-Resource Domain Adaptation for Abstractive Summarization}, author={Tiezheng Yu and Zihan Liu and Pascale Fung}, journal={arXiv preprint arXiv:2103.11332}, year={2021} }
Also, please consider citing all the individual datasets in your paper.
Dialog domain:
@inproceedings{gliwa2019samsum, title={SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization}, author={Gliwa, Bogdan and Mochol, Iwona and Biesek, Maciej and Wawer, Aleksander}, booktitle={Proceedings of the 2nd Workshop on New Frontiers in Summarization}, pages={70--79}, year={2019} }
Email domain:
@inproceedings{zhang2019email, title={This Email Could Save Your Life: Introducing the Task of Email Subject Line Generation}, author={Zhang, Rui and Tetreault, Joel}, booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics}, pages={446--456}, year={2019} }
Movie and debate domains:
@inproceedings{wang2016neural, title={Neural Network-Based Abstract Generation for Opinions and Arguments}, author={Wang, Lu and Ling, Wang}, booktitle={Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies}, pages={47--57}, year={2016} }
Social media domain:
@inproceedings{kim2019abstractive, title={Abstractive Summarization of Reddit Posts with Multi-level Memory Networks}, author={Kim, Byeongchang and Kim, Hyunwoo and Kim, Gunhee}, booktitle={Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)}, pages={2519--2531}, year={2019} }
Science domain:
@inproceedings{yasunaga2019scisummnet, title={Scisummnet: A large annotated corpus and content-impact models for scientific paper summarization with citation networks}, author={Yasunaga, Michihiro and Kasai, Jungo and Zhang, Rui and Fabbri, Alexander R and Li, Irene and Friedman, Dan and Radev, Dragomir R}, booktitle={Proceedings of the AAAI Conference on Artificial Intelligence}, volume={33}, pages={7386--7393}, year={2019} }