facebookresearch / DPR

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.
Other
1.7k stars 300 forks source link

problem with retriever training on xlm-roberta #179

Open ffaisal93 opened 3 years ago

ffaisal93 commented 3 years ago

Hello, I added a bit of simialr code(adding xlmroberta tokenizer, encoder description) to use xlm-roberta-base from huggingface instead of bert encoder model. Problem is when I trying to train on xlm-roberta, the training loss gets stuck in the range of 3-4. So it doesn't train at all. In all other cases, after some epoch the training loss gets lower than 1. Here is the training log: https://paste.ee/p/eNuXR

In all other cases, it just works fine. Can anyone point me to anything, what I am missing here? Thanks my command:

python -m torch.distributed.launch --nproc_per_node=4 train_dense_encoder.py \ --max_grad_norm 2.0 --encoder_model_type hf_xlmr --pretrained_model_cfg xlm-roberta-base \ --sequence_length 256 --warmup_steps 1237 --batch_size 12 --do_lower_case \ --train_file "$tfile.*" \ --dev_file $dfile \ --output_dir $outdir --learning_rate 2e-05 --num_train_epochs 40 \ --dev_batch_size 12 --val_av_rank_start_epoch 10

vlad-karpukhin commented 3 years ago

Hi @ffaisal93 , I can see the train loss goes () down but very slowly (see "Avg. loss per last 100 batches:"). You use 4x12 questions per effective batch which is quite low number compared to the "standard" 8x16 scheme and that should affect the performance. Also, your first epoch validation (I can only see single validation in your log): "NLL Validation: loss = 4.386976. correct prediction ratio 129/6528 ~ 0.019761" - this is a really low performance. It says in only ~2% of samples the model produces highest score for the correct passage out of 4x12=48 candidates. If I remember correctly, the base DPR has this number ~ 80% after the first train epoch for the NQ train set. Do you use correct roberta tokenizer? if yes, I recommend to play with learning rate & max_grad_norm(try to increase it to 10, for example) parameters.

ffaisal93 commented 3 years ago

Hello @vlad-karpukhin , Thanks a lot for your reply. I will try to adjust learning rate & max_grad_norm.

Weirdly only for base xlm-r I see this type of problem. For example, here I am providing 3 full training logs for the context:

  1. mbert: https://paste.ee/p/h9TJr (trained fine, we can see loss quickly falls below 1)
  2. xlmr: https://paste.ee/p/kceSz (for all 40 epochs, rotate around training loss 3 and 4....so not working, the validation score is very very low because of it)
  3. a finetuned xlmr: https://paste.ee/p/EsJZK (I used a multilingual alignmentbased finetuning(using parallel language data) on same xlmr. While setting this finetuned xlmr as checkpoint, this just work fine and in validation gives better socre than mbert. So I think, problem is not the tokenizer, otherwise this wouldn't work too. I set the xlmrobertatokenizer. Only the base xlmr doesn't train.) Thanks
sumit-agrwl commented 2 years ago

@ffaisal93 were you able to solve the issue? I am facing the same issue of the loss fluctuating between 3 and 4.