Cannot reproduce cross lingual transfer result

cindyxinyiwang commented 3 years ago

Environment info

adapter-transformers version: 1.1.1
Platform: Linux-4.15.0-133-generic-x86_64-with-debian-stretch-sid
Python version: 3.7.4
PyTorch version (GPU?): 3.6

Details

I'm trying to replicate the result for fine-tuning XLM-R base on NER task by training my own task adapter on the English NER data. However, for my result, most of the languages are much lower than full fine-tuning. For example, Arabic is about 5 points lower than simply fully fine-tune the model on English NER. The MAD-X paper, on the other hand, seems to indicate the result should be somewhat comparable to full fine-tuning.

I'm optimizing the adapter using config "pfeiffer" with reduction factor of 2. I finetune the task adapter for 100 epochs with learning rate 1e-4. I also modified from the code here: https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/token-classification/run_ner.py#L244

Would be great if you could upload the script/code to replicate the result in the paper. Thanks!

JoPfeiff commented 3 years ago

Hey @cindyxinyiwang, Just training task adapters (without language adapters) tends to underperform full model fine-tuning. However, Adapters (in contrast to full fine-tuning) are more robust to random seeds and other hyperparamter configurations. The reported results in our MAD-X paper are averaged over 5 random seeds, so that might explain why your best fully fine-tuned model is better than your best task-adapter.

Some other hyperparameters you might want to consider: In the MAD-X paper we use a reduction factor of 16 (so bottleneck size 48) for our task adapters. We also evaluated after every epoch and took the adapter with the best performance on the dev set.

The script you linked is the one we used for our experiments.

JoPfeiff commented 3 years ago

When you work with language adapters too, you might want to check out our new paper, where we find that by dropping the adapters in the last transformer layer (MAD-X 2.0), we achieve considerable zero-shot performance gains.

cindyxinyiwang commented 3 years ago

Thank you so much for your quick response! I'm using English language adapter while fine-tuning on English labeled data. I also tried multiple random seeds. Generally for mBERT/XLMR seems like my results on average is about 1-2 points lower compared to full fine-tuning on XNLI. The gap is larger for NER, especially on non-Latin language. Is that normal? I'm wondering if there is some version mismatch with the pretrained language adapters on adapter hub. I tried setting adapter_reduction_factor to 16 but it's not much different - actually slightly worse.

Thanks for the great work, and thank you so much for your help!

cindyxinyiwang commented 3 years ago

For some context, I'm using the NER processing script in XTREME benchmark. For the language my, the result in the paper has an F1 of 51 when transferred from English, but I only got around 40.

JoPfeiff commented 3 years ago

Next week I will upload some of the task and language adapters we have trained for our most recent paper. You can then play around with them :)

adapter-hub / adapters

Cannot reproduce cross lingual transfer result #134

Environment info

Details