bentrevett / pytorch-seq2seq

Tutorials on implementing a few sequence-to-sequence (seq2seq) models with PyTorch and TorchText.
MIT License
5.37k stars 1.34k forks source link

Tutorial 6: [Attention is All You need] Different output at different batch size during Inference #189

Closed rajeevbaalwan closed 9 months ago

rajeevbaalwan commented 2 years ago

I have trained a transformer encoder-decoder model by replacing the encoder with some pre-trained model and putting decoder-related code (Tutorial 6 Attention is all you need) on top of the encoder and the model is getting converged properly as training proceeds. Still, when I perform sequential greedy decoding after training using different batch sizes I'm getting different WER and CER on my validation data.

My validation data is having 5437 samples, during inference I also tracked the number of samples in which EOS is being predicted. Below are the observations I'm getting

Batch Size WER CER EOS detected
1 0.859 0.672 5427
2 0.526 0.399 3915
4 0.378 0.279 4866
8 0.33 0.239 5199
16 0.326 0.235 5301
32 0.325 0.235 5361
64 0.326 0.235 5394
128 0.326 0.235 5406

I don't know what is causing this issue? Any idea what might be causing this behavior in the transformer model?