G-XLT performance for MLQA

I am trying to reproduce the results presented in table 6 of the paper for generalized XLT using M-BERT.

I have done the following.

Fine-tuned M-BERT using only the SQuAD 1.1 training dataset and validated on MLQA English dev dataset.
I have used the window approach as used by the BERT authors (https://github.com/google-research/bert/issues/66) for long sequences. I set the maximum sequence length to 384, doc stride to 128.
I considered the maximum answer length = 30.
During fine-tuning, I used the following setting.

learning rate = 5e-5
warmup_steps = 0
epochs = 3
gradient_accumulation_steps = 1
grad_clipping = 1.0

I got the following result. As you can see the performance is very poor particularly for Hindi and Vietnamese language. I think a different inference algorithm is used in your work. Is it possible to briefly explain what you did during inference?

bert

facebookresearch / MLQA

G-XLT performance for MLQA #11