Open Matthieu-Tinycoaching opened 2 years ago
Hi, 1) Yes it is possible. @kwang2049 will soon release an updated version, that is a lot better. But gold data will still be the best
2) Sadly generation models for other languages are limited, as there is not so much good training data. Here you can find translated versions of MS MARCO: https://github.com/unicamp-dl/mMARCO
You could train an mT5 generation model on this. Note: The mT5 models are there not for generation, but for re-ranking
Otherwise you can also use machine translation
3) Yes, this would be the best
Hi @nreimers thanks for feedback!
Great news! You mean that that a new multilingual model better than paraphrase-multilingual-MiniLM-L12-v2
?
How could I train an mT5 generation model based on https://github.com/unicamp-dl/mMARCO? If the mT5 is trained for re-ranking but not generation, what would be the interest of such multilingual model in order to generate question from passages?
OK, good.
1) Yes, hopefully :) 2) They provide a translated version of MS MARCO, which you can use to train an mT5 model
yes
@nreimers could you give me an estimate on when the new multilingual bi-encoder will come out?
Hi @Matthieu-Tinycoaching It will be a focus for Q1 in 2022. The crawling of large multilingual datasets made a good progress and I hope it will result in good models.
Hi @nreimers good news!
Hi,
1/ Is it possible to fine tune a multilingual bi-encoder on specific domain data using unsupervised synthethic query generation? If yes, are performance comparable to fine-tuning with supervised method?
2/ I didn't find multilingual pre-trained T5 model, could I use translation algorithm to fine-tune on the English one, then translate back to native language?
3/ For multilingual bi-encoder does it suppose that I have to fine-tune with GenQ on all languages of interest at the same time?
Thanks!