UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.97k stars 2.45k forks source link

Multilingual Information Retrieval (MS-Marco Bi-Encoders) #695

Open janandreschweiger opened 3 years ago

janandreschweiger commented 3 years ago

Hey everyone! First of all, congratulations for your new Information Retrieval models. They are absolutely amazing.

My Question / Kind Request We currently use your new Bi-Encoder msmarco-distilroberta-base-v2 and desperately need a german-english model. There are great models like T-Systems-onsite/cross-en-de-roberta-sentence-transformer (EN-DE) and paraphrase-xlm-r-multilingual-v1 (50+ languages) that do a great job. I think that there are a lot of people out there who would like to have such a model fine-trained on ms-marco.

The Reason Although T-Systems-onsite/cross-en-de-roberta-sentence-transformer and paraphrase-xlm-r-multilingual-v1 are outstanding models, they don't perform good in a real world search-system. They don't work well with short query inputs and longer paragraphs in the knowledge base. Your Bi-Encoders are a game changer. They make transformers really applicable.

Keep it up You always surprise me about with what transformers are capable of. Thanks for all of your effort and keep up the great work!

nreimers commented 3 years ago

Hi @peterchiuglg Not sure when I will have the time. I have some other papers & projects which I need first. Then I will need to translate the training data (which takes some time) and also improve multilingual knowledge distillation to be more efficient.

janandreschweiger commented 3 years ago

Hi @nreimers, will there be a drop in performance compared to the already released German-English-only-temporary model? If so could you perhaps publish a model for just those two languages? Provided there is enough demand.

geraldwinkler commented 3 years ago

Hey @janandreschweiger,

thanks for asking. My colleagues and I are also concerned about that, as we would like to use the new models, once they are published.

nreimers commented 3 years ago

Hi @janandreschweiger @geraldwinkler I hope there will be no drop in performance.

But I plan to publish the translated training data, so that recreating specific bi-lingual models will be easy.

peterchiuglg commented 3 years ago

Hi @peterchiuglg

Not sure when I will have the time. I have some other papers & projects which I need first. Then I will need to translate the training data (which takes some time) and also improve multilingual knowledge distillation to be more efficient.

@nreimers How do you do translation? We can help with the Chinese translation part!

ace-kay-law-neo commented 3 years ago

Hey @nreimers I also hope that there won't be any drop in performance, as I don't have enough resources to retrain everything for German-English. I'm especially concerned about languages like Chinese, as they differ a lot from Spanish, Italian, German or English. You mentioned that there will be a multilingual cross-encoder (for German-English). Have you any idea when this model will be published? Thanks a lot for all the effort you put into the multilingual retrieval models, they are absolutely amazing!

PanicButtonPressed commented 3 years ago

Hi @nreimers, first of all, thanks for providing the EN-DE models and for your help. Regarding the NDCG results, did you use the entire translated msmarco dataset for training or just a subset?

nreimers commented 3 years ago

Hi @ace-kay-law-neo Sadly no idea on the timeline, as more and more projects are piling up.

Hi @PanicButtonPressed Yes, all passages were translated and used for multilingual knowledge distillation.

josemarcosrf commented 3 years ago

We are also looking to solve the same problem. We are specially interested in the cross-lingual setup (1 model for several language pairs). But we are thinking on starting by training a FR-EN version and a ES-EN version.

Seeing that there's a number of people/teams trying to solve the same or a similar problem we would be happy to coordinate and divide the tasks to train this models and expand to as many language pairs as possible.

We could avoid duplication of efforts in this way if there's a number of people doing similar work.

@nreimers if there are specific areas where you'd like a hand happy to discuss. For example, if there are still translations to be done we could start helping there. Then we can move into distillation and so forth.

fighttiger25 commented 3 years ago

Hi, @nreimers , I played a bit of your new model "msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned". It works quite good in short queries. Could you please disclose a bit of timeline regarding the Italian and French version of it?

Matthieu-Tinycoaching commented 2 years ago

@nreimers any news of timeline regarding the french version of msmarco model?

Thanks!

nickchomey commented 1 year ago

I just made this comment in another similar issue - it should solve this problem.

Has anyone here tried the newest multilingual Cross Encoder model? It uses multilingual versions of the MiniLM and MSMarco datasets. It doesn't appear to be in the SBert documentation, but I just stumbled upon it while browsing HF. https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1

There isn't any benchmark data, but this paper seems to have used a fairly similar process and shows that these multilingual datasets/models provide very competitive results when compared to monolingual datasets. https://arxiv.org/pdf/2108.13897.pdf