Open yashgupta-7 opened 3 years ago
Hi, for the TAPT and DAPT, we pre-trained the model for 10 epochs. For the SDPT, we pre-trained the model on CNN dataset for 780000 steps.
Hi, for the TAPT and DAPT, we pre-trained the model for 10 epochs. For the SDPT, we pre-trained the model on CNN dataset for 780000 steps.
Thank you for the response! I tried doing DAPT training with the DAPT dataset for debate and it takes around 8 hrs for just a single epoch ( batch size 4 and around 400,000 lines ). The email dataset seems to have much more (50x more). Are my numbers/estimates right or am I missing something here? Thanks.
Yes, it takes a long time to train DAPT, but it's not that long as you said. Do you try with gradient accumulation? We do gradient accumulation for every 10 steps so it's faster.
Yes, it takes a long time to train DAPT, but it's not that long as you said. Do you try with gradient accumulation? We do gradient accumulation for every 10 steps so it's faster.
Yes, I am doing that since the code does this by default ( I can see the default value as 10 ) . This is how it shows for the debate dataset when I start DAPT training. Almost 9 hours an epoch for 96,000 batches. Is this similar for you? I am using a GTX-1080 Ti.
Yes, I am also using a GTX-1080 Ti and I think there is no huge gap between my training time and yours. It's not necessary to finish the 10 epochs to do testing. Maybe you can test the checkpoints from the first several epochs.
Btw, Do RecAdam in DAPT is not stable, sometimes it will cause gradient explosion. So, I suggest not to use RecAdam and try to test the checkpoints from the first several epochs.
Hey! great work and congratulations on NAACL acceptance!
Can you specify how do you determine the number of intermediate pretraining steps for TAPT/DAPT/SDPT?
Thanks in advance.