Tutorial 6: [Attention is All You need] Different output at different batch size during Inference

I have trained a transformer encoder-decoder model by replacing the encoder with some pre-trained model and putting decoder-related code (Tutorial 6 Attention is all you need) on top of the encoder and the model is getting converged properly as training proceeds. Still, when I perform sequential greedy decoding after training using different batch sizes I'm getting different WER and CER on my validation data.

My validation data is having 5437 samples, during inference I also tracked the number of samples in which EOS is being predicted. Below are the observations I'm getting

Batch Size	WER	CER	EOS detected
1	0.859	0.672	5427
2	0.526	0.399	3915
4	0.378	0.279	4866
8	0.33	0.239	5199
16	0.326	0.235	5301
32	0.325	0.235	5361
64	0.326	0.235	5394
128	0.326	0.235	5406

I don't know what is causing this issue? Any idea what might be causing this behavior in the transformer model?

bentrevett / pytorch-seq2seq

Tutorial 6: [Attention is All You need] Different output at different batch size during Inference #189