UKPLab / gpl

Powerful unsupervised domain adaptation method for dense retrieval. Requires only unlabeled corpus and yields massive improvement: "GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval" https://arxiv.org/abs/2112.07577
Apache License 2.0
315 stars 39 forks source link

Multi-lingual GPL #17

Open Matthieu-Tinycoaching opened 2 years ago

Matthieu-Tinycoaching commented 2 years ago

Hi,

General problem with multilingual models: give unequal performance among languages if the proportion of docs in lang A is greatly superior to the proportion of docs in lang B.

Wouldn't it be beneficial for the multilingual model to translate all docs in all languages before fine-tuning with multi-lingual GPL?

Thanks!

Matthieu-Tinycoaching commented 2 years ago

Hi @nreimers @kwang2049 ,

Would you have an advice regarding my previous question?

Thanks!

nreimers commented 1 year ago

Maybe. You would need to test it

kwang2049 commented 1 year ago

@Matthieu-Tinycoaching Sorry that I have not studied multilingual scenarios myself and this is beyond the scope of my knowledge. As @nreimers said, maybe you can test it and compare different cases. If the cost (e.g. translation) is an issue, maybe you can scale the experiments from little to large and see what is the trend.

Good luck and welcome to sharing your results and conclusions:)

nickchomey commented 1 year ago

I wonder if these new multilingual query generator and cross encoder models could be used?

https://huggingface.co/doc2query/msmarco-14langs-mt5-base-v1 https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1