README.md recommendations: weight and dropc

sshleifer commented 4 years ago

I saw your paper at ACL and want to test it out in my MT/Summarization training (code):[https://github.com/huggingface/transformers/blob/master/examples/seq2seq/finetune.py]

What should I pass as weight to nn.NLLLoss and what is the recommended dropc value?

Thanks!

ddkang commented 4 years ago

Thank you for your interest in our work!

While we recommend the following, your task-optimal hyperparameters might vary:

The weight should be as in the default hyperparameters for regular training. Since reduction is "none", make sure to update the weight to account for that!
You can think of dropc as a crude approximation of the fraction of data you think might be noisy, unfaithful, or otherwise not model-able. We've seen anywhere between 0.3-0.9 work well in practice.

Please let me know if you need anything else!

sshleifer commented 4 years ago

In your MT experiment, what do you use for weights? Here is my loss fn (modified for/from https://github.com/huggingface/transformers/blob/master/examples/seq2seq/finetune.py#L151 )

            lm_logits = outputs[0]  # shape bs, seq_len, vocab_size
            assert lm_logits.shape[-1] == self.model.config.vocab_size
            loss_fct = torch.nn.NLLLoss(reduction='none', ignore_index=pad_token_id)
            # previously loss_fct = torch.nn.CrossEntropyLoss(ignore_index=pad_token_id)
            loss = loss_fct(lm_logits.view(-1, lm_logits.shape[-1]), lm_labels.view(-1))
            loss = loss.view(-1, bs)
            loss = loss.mean(dim=0)
            mask = self.dropper(loss)
            loss *= mask
            loss = loss.mean()

after going down for the first few steps, the training loss seems to have gotten stuck

(for the baseline, it goes down for much longer.)

maybe I need to use the weights=torch.ones(vocab_size)? Were your losses for mt similarly on the order of -200?

Thanks in advance!

sshleifer commented 4 years ago

After 1 hr, baseline has BLEU 21.57 (finetuning mbart on wmt-en-ro) dropper(dropc=0.3) has BLEU of 13.5, which seems to me like a bug.

ddkang commented 4 years ago

Hi @sshleifer thank you for trying out our code!

Can you try two things:

Can you try using dropper with dropc=0? This should emulate regular training. If this doesn't work, then there's likely a bug on our end that I can investigate.
If step 1 works, can you try using weights already trained on WMT as the initial set of weights?

sshleifer commented 4 years ago

I have run 4 12h experiments now, starting with your step 1. The key results are:

nn.CrossEntropyLoss (my original loss fn) works better both with and without dropper than NLLLoss.
I get roughly equivalent validation BLEU with dropc=.05 and dropc=0. dropc=0.3 is 1 BLEU worse.

I'm still not passing the weights parameter, but I think a more likely issue is that this method is not as useful for finetuning as for training from scratch? Did you guys ever try it just at the finetuning phase?

Thanks, Sam

ddkang commented 4 years ago

@sshleifer apologies for the misunderstanding - I didn't realize you were using the cross entropy loss. You should be able to use loss dropping with cross entropy as well. I can try to run your code if it's available.

Regarding the finetuning vs training from scratch, our full procedure is:

Train a model fully with normal hyperparameters
Fine-tune using loss dropping We've found that the fine-tuning improves performance over training from scratch.

sshleifer commented 4 years ago

Did not understand it was a second stage, my bad! I tried it on an english-romanian translator trained on a superset of the wmt english-romanian dataset in another repo (MarianMT). The val BLEU score starts well below the score before training (my bad), but goes down (you might be able to help) with loss dropper. I understand it would be better if I had trained the model from scratch with my code, but I know the training code works from other experiments. I could also point you to a worse checkpoint trained with this exact code if that's helpful.

Here are some more diagnostics https://app.wandb.ai/sshleifer/dmar/runs/2jwhfius?workspace=user-sshleifer

I cut you a branch of my transformers fork to reproduce/see if anything sticks out, no rush/pressure at all.

Rough Instructions

git clone git@github.com:sshleifer/transformers_fork.git
git checkout dropper-marian
pip install -e .
pip install -r examples/requirements.txt
# get data
cd examples/seq2seq
wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_ro.tar.gz
tar -xzvf wmt_en_ro.tar.gz
cat train_marian_loss_dropper.sh # contents are below

you may need to remove fp16/install (you may need to install/signup or remove that Then run the script

./train_marian_loss_dropper.sh --output_dir marian_loss_dropper

The actual integration is in finetune.py, your code is loss_dropper.py.

Command

for your perusal, here are the contents of train_marian_loss_dropper.sh:

export PYTHONPATH="../":"${PYTHONPATH}"
export WANDB_PROJECT=dmar
m=Helsinki-NLP/opus-mt-en-ro
export MAX_LEN=128

# Instructions to get the dataset:
# wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_ro.tar.gz
# tar -xzvf wmt_en_ro.tar.gz

python finetune.py \
  --learning_rate=3e-4 \
  --loss_dropper_dropc 0.3 \
  --do_train \
  --fp16 --fp16_opt_level=O1 \
  --val_check_interval 0.25 \
  --data_dir wmt_en_ro \
  --max_source_length $MAX_LEN --max_target_length $MAX_LEN --val_max_target_length $MAX_LEN --test_max_target_length $MAX_LEN \
  --train_batch_size=$BS --eval_batch_size=$BS \
  --tokenizer_name $m --model_name_or_path $m \
  --warmup_steps 500 --sortish_sampler \
  --gpus 1  --task translation \
  "$@"

ddkang / loss_dropper

README.md recommendations: weight and dropc #2

Rough Instructions

Command