Open sshleifer opened 4 years ago
Thank you for your interest in our work!
While we recommend the following, your task-optimal hyperparameters might vary:
Please let me know if you need anything else!
In your MT experiment, what do you use for weights? Here is my loss fn (modified for/from https://github.com/huggingface/transformers/blob/master/examples/seq2seq/finetune.py#L151 )
lm_logits = outputs[0] # shape bs, seq_len, vocab_size
assert lm_logits.shape[-1] == self.model.config.vocab_size
loss_fct = torch.nn.NLLLoss(reduction='none', ignore_index=pad_token_id)
# previously loss_fct = torch.nn.CrossEntropyLoss(ignore_index=pad_token_id)
loss = loss_fct(lm_logits.view(-1, lm_logits.shape[-1]), lm_labels.view(-1))
loss = loss.view(-1, bs)
loss = loss.mean(dim=0)
mask = self.dropper(loss)
loss *= mask
loss = loss.mean()
after going down for the first few steps, the training loss seems to have gotten stuck
(for the baseline, it goes down for much longer.)
maybe I need to use the weights=torch.ones(vocab_size)
?
Were your losses for mt similarly on the order of -200?
Thanks in advance!
After 1 hr, baseline has BLEU 21.57 (finetuning mbart on wmt-en-ro)
dropper(dropc=0.3)
has BLEU of 13.5, which seems to me like a bug.
Hi @sshleifer thank you for trying out our code!
Can you try two things:
I have run 4 12h experiments now, starting with your step 1. The key results are:
nn.CrossEntropyLoss
(my original loss fn) works better both with and without dropper
than NLLLoss.I'm still not passing the weights
parameter, but I think a more likely issue is that this method is not as useful for finetuning as for training from scratch? Did you guys ever try it just at the finetuning phase?
Thanks, Sam
@sshleifer apologies for the misunderstanding - I didn't realize you were using the cross entropy loss. You should be able to use loss dropping with cross entropy as well. I can try to run your code if it's available.
Regarding the finetuning vs training from scratch, our full procedure is:
Did not understand it was a second stage, my bad! I tried it on an english-romanian translator trained on a superset of the wmt english-romanian dataset in another repo (MarianMT). The val BLEU score starts well below the score before training (my bad), but goes down (you might be able to help) with loss dropper. I understand it would be better if I had trained the model from scratch with my code, but I know the training code works from other experiments. I could also point you to a worse checkpoint trained with this exact code if that's helpful.
Here are some more diagnostics https://app.wandb.ai/sshleifer/dmar/runs/2jwhfius?workspace=user-sshleifer
I cut you a branch of my transformers fork to reproduce/see if anything sticks out, no rush/pressure at all.
git clone git@github.com:sshleifer/transformers_fork.git
git checkout dropper-marian
pip install -e .
pip install -r examples/requirements.txt
# get data
cd examples/seq2seq
wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_ro.tar.gz
tar -xzvf wmt_en_ro.tar.gz
cat train_marian_loss_dropper.sh # contents are below
you may need to remove fp16/install (you may need to install/signup or remove that Then run the script
./train_marian_loss_dropper.sh --output_dir marian_loss_dropper
The actual integration is in finetune.py, your code is loss_dropper.py.
for your perusal, here are the contents of train_marian_loss_dropper.sh
:
export PYTHONPATH="../":"${PYTHONPATH}"
export WANDB_PROJECT=dmar
m=Helsinki-NLP/opus-mt-en-ro
export MAX_LEN=128
# Instructions to get the dataset:
# wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_ro.tar.gz
# tar -xzvf wmt_en_ro.tar.gz
python finetune.py \
--learning_rate=3e-4 \
--loss_dropper_dropc 0.3 \
--do_train \
--fp16 --fp16_opt_level=O1 \
--val_check_interval 0.25 \
--data_dir wmt_en_ro \
--max_source_length $MAX_LEN --max_target_length $MAX_LEN --val_max_target_length $MAX_LEN --test_max_target_length $MAX_LEN \
--train_batch_size=$BS --eval_batch_size=$BS \
--tokenizer_name $m --model_name_or_path $m \
--warmup_steps 500 --sortish_sampler \
--gpus 1 --task translation \
"$@"
I saw your paper at ACL and want to test it out in my MT/Summarization training (code):[https://github.com/huggingface/transformers/blob/master/examples/seq2seq/finetune.py]
What should I pass as
weight
tonn.NLLLoss
and what is the recommendeddropc
value?Thanks!