facebookresearch / LASER

Language-Agnostic SEntence Representations
Other
3.59k stars 462 forks source link

Source code for training #70

Open okgrammer opened 5 years ago

okgrammer commented 5 years ago

Is there any plan to release the code/scripts for training the encoder? I would like to train using my own data. Thanks!

guotong1988 commented 5 years ago

Same question.

guotong1988 commented 5 years ago

This project do not support training, do you know some other code for sentence embedding?

PadLex commented 5 years ago

This project do not support training, do you know some other code for sentence embedding?

You could try to use BERT

guotong1988 commented 5 years ago

It seems BERT can not output good sentence embedding. https://arxiv.org/abs/1904.07531v4

nreimers commented 5 years ago

Bert out of the box does not yield good embeddings, that's true. But with some fine-tuning it can give you really nice embeddings.

See https://github.com/UKPLab/sentence-transformers

How to Fine-Tune BERT to give you good sentence embeddings.

guotong1988 commented 5 years ago

Thank you very much!!

PadLex commented 5 years ago

Ironically BERT is giving me significantly better results out of the box and with a night of fine tuning on a GTX 1060 it works even better.

hoschwenk commented 5 years ago

Hello, The results you are mentioning, are these for English only or involve some (zero-shot) transfer to otjer languages ? If only English is needed, than there are indeed several other approaches that you may want to compare for your task. LASER focusses on multilingual sentence embeddings which work well for many languages without the need to fine-tune them.

Proyag commented 5 years ago

There's https://github.com/transducens/LASERtrain (approximation, models not inter-compatible)

hoschwenk commented 5 years ago

Hello, We are aware that there's a lot of interest in the training code. The original LASER training code was based on the version of fairseq which now dates back almost 1 year. We are working on a substantially improved version of LASER training which will use the current fairseq and scales much better to many languages. Please be patient :-)

ever4244 commented 4 years ago

There's https://github.com/transducens/LASERtrain (approximation, models not inter-compatible)

Have you be able to run their codes? I run into errors when I run their codes: indices, ignored = _filter_by_size_dynamic() AttributeError: 'function' object has no attribute 'size'

https://github.com/pytorch/fairseq/issues/1555

hertz-pj commented 4 years ago

Hello, We are aware that there's a lot of interest in the training code. The original LASER training code was based on the version of fairseq which now dates back almost 1 year. We are working on a substantially improved version of LASER training which will use the current fairseq and scales much better to many languages. Please be patient :-)

Hello, how is this project going

sebastian-nehrdich commented 4 years ago

any update on this? I am very interested in training my own models too.

ever4244 commented 4 years ago

any update on this? I am very interested in training my own models too.

I have used this codes for training. It can achieve similar performance.

https://github.com/raymondhs/fairseq-laser

nreimers commented 4 years ago

Hi @sebastian-nehrdich

It is not the LASER training, but if your are open to other multilingual sentence embedding training methods that work for several tasks better than LASER: https://github.com/UKPLab/sentence-transformers/blob/master/docs/training/multilingual-models.md

Details can be found in this paper: Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation https://arxiv.org/abs/2004.09813

Celebio commented 3 years ago

Hi @okgrammer , The training code is available here.

Onur