I have trained a transformer encoder-decoder model by replacing the encoder with some pre-trained model and putting decoder-related code (Tutorial 6 Attention is all you need) on top of the encoder and the model is getting converged properly as training proceeds. Still, when I perform sequential greedy decoding after training using different batch sizes I'm getting different WER and CER on my validation data.
My validation data is having 5437 samples, during inference I also tracked the number of samples in which EOS is being predicted.
Below are the observations I'm getting
Batch Size
WER
CER
EOS detected
1
0.859
0.672
5427
2
0.526
0.399
3915
4
0.378
0.279
4866
8
0.33
0.239
5199
16
0.326
0.235
5301
32
0.325
0.235
5361
64
0.326
0.235
5394
128
0.326
0.235
5406
I don't know what is causing this issue? Any idea what might be causing this behavior in the transformer model?
I have trained a transformer encoder-decoder model by replacing the encoder with some pre-trained model and putting decoder-related code (Tutorial 6 Attention is all you need) on top of the encoder and the model is getting converged properly as training proceeds. Still, when I perform sequential greedy decoding after training using different batch sizes I'm getting different WER and CER on my validation data.
My validation data is having 5437 samples, during inference I also tracked the number of samples in which EOS is being predicted. Below are the observations I'm getting
I don't know what is causing this issue? Any idea what might be causing this behavior in the transformer model?