facebookresearch / LASER

Language-Agnostic SEntence Representations
Other
3.59k stars 462 forks source link

Will LASER 2.0 be released? #155

Closed bayanbatn closed 1 year ago

bayanbatn commented 4 years ago

This was mentioned internal use of this here: https://ai.facebook.com/blog/introducing-many-to-many-multilingual-machine-translation/. Will this be publicly released?

hoschwenk commented 4 years ago

Yes, we have a new model trained with SPM. This avoids all the issues with tokenization, in particular with mecab.

bayanbatn commented 4 years ago

nice, thanks! is there an ETA (even if rough) for when this will be available?

mzeidhassan commented 4 years ago

Hi @hoschwenk Same question here. Any rough date when the new version will be released? Thanks!

interlark commented 3 years ago

@hoschwenk Is it gonna happen in 2020? Thank you.

loretoparisi commented 3 years ago

LASER so far has been proved to be among the best approach to STS, even compared to more recent language models like BERT or XLM. There is a good benchmark here, so it would be worth to try LASER 2.0

nreimers commented 3 years ago

LASER so far has been proved to be among the best approach to STS, even compared to more recent language models like BERT or XLM. There is a good benchmark here, so it would be worth to try LASER 2.0

I have to disagree, see: https://arxiv.org/pdf/2004.09813.pdf

STSperformance

LASER does not work that well on estimating the semantic similarity for a given text pair compared to other, multi-lingual models.

However, LASER works quite well for finding translation pairs in large set of corpora. Here, the performances for BUCC text mining task: bucc

There, LASER was recently out-performed by LaBSE, especially when you go for low resource language.

tatoeba

None the less, releasing LASER 2.0 and seeing how it performance and differs from LASER 1.0 would still be interesting and valuable.

loretoparisi commented 3 years ago

@nreimers thanks a lot for the details. I agree, on the "paper" benchmarks are there, but, on the applied field, things maybe a bit different, depending on the document context, language support, performances (like inference time) etc. By the way, LaBSE seems to be very promising, so it a good candidate currently, and of course a comparison to LASER 2.0 would be the best, thank you.

nreimers commented 3 years ago

Hi @loretoparisi Yes, it depends greatly on the task.

I made good experiences with LASER / LaBSE when you search for translated text. However, if you want to find similar sentences that are not perfect translations (e.g. in semantic search or question answering), LASER / LaBSE did not perform that well on my tasks. They were trained to spot good translated sentences, but they have issues if two sentences have similar meaning but are to certain degrees different (e.g. like 'How can I cook Spaghetti' vs. 'How to cook pasta quickly')

There, I achieved better results with Universal Sentence Encoder and Sentence-BERT as they better understand how similar sentences are. But if these models are applied on bitext mining, they sometimes return semantically similar sentences that are not 1-to-1 translations of each other.

loretoparisi commented 3 years ago

@nreimers thanks a lot, this helps. When you mention Google's Universal Sentence Encoder (that in the multi-lingual vanilla has only 16 languages) you mean it as an alternative for embedding calculation or used in combination to your SentenceBERT, I do not find a specific example here: https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications

Thanks a lot.

nreimers commented 3 years ago

Hi @loretoparisi as an alternative. Google's Universal Sentence Encoder was trained on quite a large and noisy dataset, so I find that it performs well if you have noisy data, for example, tweets or text from online communities.

SentenceBERT was trained on more clean data, so the models we publish there work well if you have clean data as well. But it hasn't seen that much noisy data and has some issues with those.

But finally if you have training data, having a custom embedding model outperforms always out-of-the-box models.

In our AugSBERT paper we compared USE to custom models for argument similarity scoring (BWS), duplicate questions identification (Quora-QP) and News Paraphrase Identification (MRPC). AugSBERT

So if you have training data, you can profit quite a lot when you train your own models.

Yes, Google Universal Sentence Encoder is originally only available for 16 languages, put I provide here a version that supports 50+ languages: distiluse-base-multilingual-cased-v2 https://www.sbert.net/docs/pretrained_models.html

Training code to extend Universal Sentence Encoder (or any other sentence embedding model) to 100+ languages is also available

hoschwenk commented 3 years ago

Thanks a lot for the various comparative results. It seems to me that LASER still performs surprisingly well on "near similarity tasks", e.g. bitext mining, in particular if fine-tuning on some very task-specific data is not possible or wanted.

Other tasks like STS which use several degrees of similarity, or other notions of similarity like NLI, models based on transformers are usually better. But in general, I was always sceptical how one can say that for example a score of 3.2 is good while 3.7 is wrong :-)

It is very important to measure performance on low-resource languages, but I wouldn't be too confident on the results on Tatoeba. 2 years ago, that was the best eval set we found, but the variety, quantity and quality of sentences for low-resources languages is doubtful. Hopefully, we have better data soon

hoschwenk commented 3 years ago

We will release the LASER2 model and encoder by the end of the month. It will use SPM for all languages, which should resolve all the annoying issues with fastBPE or Moses/Jieba/Mecab tokenizers.

LASER2 should also perform better on low-resource languages.

simonefrancia commented 3 years ago

Can't wait to use all models. Thank you very much for the infos!

nreimers commented 3 years ago

Great, looking forward to it :)

Does LASER 2 use a different architecture as LASER 1, or is based on the same architecture and idea (Encoder-Decoder with LSTM network trained in a translation setting?)

loretoparisi commented 3 years ago

@hoschwenk that's amazing news, low-resource languages support is something almost missing right now, so thanks! In fact my main issue now is the integration of those three, that I had to customize a lot (Mecab not script but packaged, glampe fastBPE as well (https://github.com/glample/fastBPE/issues/12) so cannot wait to try it out!

hoschwenk commented 3 years ago

Great, looking forward to it :)

Does LASER 2 use a different architecture as LASER 1, or is based on the same architecture and idea (Encoder-Decoder with LSTM network trained in a translation setting?)

LASER2 uses the same training recipe and network architecture, but with better up-sampling to handle low resource languges. Therefore, it should run at the same speed as LASER1 (i.e. probably faster than deep transformers)

loretoparisi commented 3 years ago

@hoschwenk happy new year! Any news about LASER2? Thank you!

interlark commented 3 years ago

@hoschwenk happy Valentine's Day! Still no updates for LASER2?

Celebio commented 3 years ago

Hi @interlark @loretoparisi , The training code is available here.

Onur

virgulvirgul commented 3 years ago

Hi @interlark @loretoparisi , The training code is available here.

Onur

Hi Onur, Could you please look at this error;

https://github.com/pytorch/fairseq/issues/3540

loretoparisi commented 3 years ago

@Celebio thank you so much for your support! Any plane to release a pretrained model?

avidale commented 1 year ago

LASER-2 and LASER-3 pretrained models were released last year, and now they are used by default in the embed task.