Open Matthieu-Tinycoaching opened 3 years ago
We are currently working on it. It is a larger project, as we need to mine large quantity of authentic multilingual training data. Further, how to train the model is unclear. Also, the evaluation is not clear, as there are not only few resources available so far. So I expect this takes a bit longer.
You can check multilingual MS MARCO: https://github.com/unicamp-dl/mMARCO
it as a manche translated version of MS MARCO which can be used for training models.
Hi @nreimers thanks for helping.
Would it then be possible to use the multilingual MS Marco model you proposed (https://github.com/unicamp-dl/mMARCO) for asymmetric semantic search or QA?
Yes
OK.
In case of QA, is it better to apply retrieve & re-rank or just the multilingual MS Marco model? In case, retrieve & re-rank is better which multilingual cross-encoder could I then use?
Hi @nreimers could give me an advice to my last question?
Retrieve and re rank is better, but slower. As of now there are not that many multilingual cross encoders. I just know these https://github.com/unicamp-dl/mMARCO
@Matthieu-Tinycoaching I just made this comment in another similar issue (#1011)
Has anyone here tried the newest multilingual Cross Encoder model? It uses multilingual versions of the MiniLM and MSMarco datasets. It doesn't appear to be in the SBert documentation, but I just stumbled upon it while browsing HF. https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1
There isn't any benchmark data, but this paper seems to have used a fairly similar process and shows that these multilingual datasets/models provide very competitive results when compared to monolingual datasets. https://arxiv.org/pdf/2108.13897.pdf
Hi,
Would there be any multilingual versions of the multi-qa models or particularly english-french model?
Thanks!