TysonYu / AdaptSum

The code repository for NAACL 2021 paper "AdaptSum: Towards Low-Resource Domain Adaptation for Abstractive Summarization".
Creative Commons Attribution 4.0 International
35 stars 2 forks source link

Number of Finetuning Steps for TAPT/DAPT/SDPT #1

Open yashgupta-7 opened 3 years ago

yashgupta-7 commented 3 years ago

Hey! great work and congratulations on NAACL acceptance!

Can you specify how do you determine the number of intermediate pretraining steps for TAPT/DAPT/SDPT?

Thanks in advance.

TysonYu commented 3 years ago

Hi, for the TAPT and DAPT, we pre-trained the model for 10 epochs. For the SDPT, we pre-trained the model on CNN dataset for 780000 steps.

yashgupta-7 commented 3 years ago

Hi, for the TAPT and DAPT, we pre-trained the model for 10 epochs. For the SDPT, we pre-trained the model on CNN dataset for 780000 steps.

Thank you for the response! I tried doing DAPT training with the DAPT dataset for debate and it takes around 8 hrs for just a single epoch ( batch size 4 and around 400,000 lines ). The email dataset seems to have much more (50x more). Are my numbers/estimates right or am I missing something here? Thanks.

TysonYu commented 3 years ago

Yes, it takes a long time to train DAPT, but it's not that long as you said. Do you try with gradient accumulation? We do gradient accumulation for every 10 steps so it's faster.

yashgupta-7 commented 3 years ago

Yes, it takes a long time to train DAPT, but it's not that long as you said. Do you try with gradient accumulation? We do gradient accumulation for every 10 steps so it's faster.

Yes, I am doing that since the code does this by default ( I can see the default value as 10 ) . This is how it shows for the debate dataset when I start DAPT training. Almost 9 hours an epoch for 96,000 batches. image Is this similar for you? I am using a GTX-1080 Ti.

TysonYu commented 3 years ago

Yes, I am also using a GTX-1080 Ti and I think there is no huge gap between my training time and yours. It's not necessary to finish the 10 epochs to do testing. Maybe you can test the checkpoints from the first several epochs.

Btw, Do RecAdam in DAPT is not stable, sometimes it will cause gradient explosion. So, I suggest not to use RecAdam and try to test the checkpoints from the first several epochs.