Closed Fethbita closed 1 year ago
Hello, Thanks for your interest in this work. In the current version of the code, there is no support to train new sentence embeddings on your own data or for other languages. Normally, this should not be necessary since our experiments have shown that the embeddings seem to be very generic and perform well on several tasks. We plan to provide code to train new encoders in the future.
To calculate sentence embeddings for arbitrary texts, just use the existing pipeline, e.g. like it is used in tasks/similarity/sim.sh. It should be straight-forward, you only need to call the bash functions "Tokenize" and "Embed" on your data. There is no need to calculate new BPE or binarization vocabularies.
Don't hesitate to contact me again if you need further assistance
We plan to provide code to train new encoders in the future.
Is there anything new regarding this?
Greetings Seb
I know that this issue is closed but I'd like to continue the discussion. This year we had at least two big events that is expected to vocabulary context: BLM and especially all related to covid-19. Since embeddings basically encode word with the help of context it meets and the context in real world changed drastically, it is expected that retraining from texts that are available today would not only add new words but also change encodings for existing words. Do you plan to retrain the embeddings it or give the possibility to do it to users?
I'll open the issue again. If it's necessary, it can be closed and locked.
Hi @Fethbita! Last year, the embeddings were retrained for 200 languages (so-called LASER-2 and LASER-3 embeddings, see https://github.com/facebookresearch/LASER/tree/main/nllb).
Also, there is code for training new LASER models from scratch (https://github.com/facebookresearch/fairseq/tree/nllb/examples/laser) and for distilling them for new languages (https://github.com/facebookresearch/fairseq/tree/nllb/examples/nllb/laser_distillation).
I hope this satisfies the request both for new embeddings and for the tools to update them by yourself.
The
install_models.sh
file downloads 3 files, one is theblstm.ep7.9langs-v1.bpej20k.model.py
file and the other two areep7.9langs-v1.bpej20k.bin.9xx
&ep7.9langs-v1.bpej20k.codes.9xx
.mlenc.py
file says that bpe_codes is "File with BPE codes (created by learn_bpe.py)." and on the research paper it is mentioned as "20k joint vocabulary for all the nine languages" I created this using learn_bpe.py as mentioned with my own data but I don't quite understand how to create the other two, hash_table "File with hash table for binarization." and model "File with trained model used for encoding." Any idea on how I can create hash_table and model? I couldn't find any documentation about them or code sample to train them. Thanks in advance.