XLM Evaluation results - Githubissues

facebookresearch / MLQA

New dataset

Other

294 stars 24 forks source link

XLM Evaluation results #9

Open nooralahzadeh opened 4 years ago

nooralahzadeh commented 4 years ago

Hi, I performed some experiments using the XLM implementation of huggingfaces training on sQuAD v1.1 training and test in the MLQA test set. The results are as follows:( f1 / EM) en 68.51/56.13 es 57.59/41.21 ar 47.88/31.41 de 51.99/38.16 zh 38.34/21.39 hi 46.13/31.72 vi 44.09/27.07 I am wondering what it has a large difference with yours. Did you do special thing except the early stopping on MLQA-en?

patrick-s-h-lewis commented 4 years ago

Hi Farad,

Which implementation of XLM are you using?

The HPs for XLM were: Adam: lr=3e-5, weight decay 0.005, clip_norm=5, epochs=3, batch size=32, triangular schedular: warmup_steps=500, total steps=10000

We used the pytext implementation of XLM. The correct tokenization and preprocessing is very important for good performance, I'm not sure whether the HF version has this correct, as a number of people have struggled to get the good results with XLM on HF. We hope to opensource the code when the colleague who wrote it is back from leave.

RachelKer commented 4 years ago

Hello, any updates on the pytext code release ? I know the COVID situation may have changed the plans. I struggle to replicate of your paper on one shot learning (that is, trained on your MLQA-train chinese) with HF XLM-R (inference with a zero-shot model on chinese works fine). Thank you !

patrick-s-h-lewis commented 4 years ago

Hi Rachel!

XLM-R wasn't included in our paper, so we can't directly help there. I'll check internally on the reproducibility code for the MLQA paper

Patrick

nooralahzadeh commented 4 years ago

Hi @RachelKer To achieve a similar performance on zh test sets, you just need to add "final_text = tok_text" after line 497 in squad_metrics.py (only for zh). Because there isn't space and sub-word in Chinese, so we don't need to execute the get_final_test() function.

RachelKer commented 4 years ago

@nooralahzadeh Thank you, I saw your issue on HF repo a few days ago and with this change I manage to get the correct results for Bert and XLM trained on chinese, but not for XLM-R. Did you manage to train XLM-Roberta on chinese ?

@patrick-s-h-lewis Oh indeed I confused XLM-R and XLM on your paper, I am sorry. I think the training problem that I have occurs with XLM-R only. Thanks for checking the code release anyway, and your quick answer !

patrick-s-h-lewis commented 4 years ago

Hey @RachelKer and @nooralahzadeh,

I asked internally about XLMR (since there is some overlap between the teams), the pytext model is released, but there aren't instructions for how to run it on MLQA, so someone is going to write these instructions up :)

Patrick