facebookresearch / LASER

Language-Agnostic SEntence Representations
Other
3.59k stars 462 forks source link

About Sentence Encoders #3

Closed Fethbita closed 1 year ago

Fethbita commented 6 years ago

The install_models.sh file downloads 3 files, one is the blstm.ep7.9langs-v1.bpej20k.model.py file and the other two are ep7.9langs-v1.bpej20k.bin.9xx & ep7.9langs-v1.bpej20k.codes.9xx. mlenc.py file says that bpe_codes is "File with BPE codes (created by learn_bpe.py)." and on the research paper it is mentioned as "20k joint vocabulary for all the nine languages" I created this using learn_bpe.py as mentioned with my own data but I don't quite understand how to create the other two, hash_table "File with hash table for binarization." and model "File with trained model used for encoding." Any idea on how I can create hash_table and model? I couldn't find any documentation about them or code sample to train them. Thanks in advance.

hoschwenk commented 6 years ago

Hello, Thanks for your interest in this work. In the current version of the code, there is no support to train new sentence embeddings on your own data or for other languages. Normally, this should not be necessary since our experiments have shown that the embeddings seem to be very generic and perform well on several tasks. We plan to provide code to train new encoders in the future.

To calculate sentence embeddings for arbitrary texts, just use the existing pipeline, e.g. like it is used in tasks/similarity/sim.sh. It should be straight-forward, you only need to call the bash functions "Tokenize" and "Embed" on your data. There is no need to calculate new BPE or binarization vocabularies.

Don't hesitate to contact me again if you need further assistance

SbstnErhrdt commented 5 years ago

We plan to provide code to train new encoders in the future.

Is there anything new regarding this?

Greetings Seb

olga-gorun commented 4 years ago

I know that this issue is closed but I'd like to continue the discussion. This year we had at least two big events that is expected to vocabulary context: BLM and especially all related to covid-19. Since embeddings basically encode word with the help of context it meets and the context in real world changed drastically, it is expected that retraining from texts that are available today would not only add new words but also change encodings for existing words. Do you plan to retrain the embeddings it or give the possibility to do it to users?

Fethbita commented 4 years ago

I'll open the issue again. If it's necessary, it can be closed and locked.

avidale commented 1 year ago

Hi @Fethbita! Last year, the embeddings were retrained for 200 languages (so-called LASER-2 and LASER-3 embeddings, see https://github.com/facebookresearch/LASER/tree/main/nllb).

Also, there is code for training new LASER models from scratch (https://github.com/facebookresearch/fairseq/tree/nllb/examples/laser) and for distilling them for new languages (https://github.com/facebookresearch/fairseq/tree/nllb/examples/nllb/laser_distillation).

I hope this satisfies the request both for new embeddings and for the tools to update them by yourself.