embed models - Githubissues

guangyuli-uoe commented 2 years ago

hi,

if i want to embed text in Chinese and text in English,

which model should i download ?

guangyuli-uoe commented 2 years ago

it says that 'LASER2 and all LASER3 encoders are downloaded by default'

where can i find them ?

heffernankevin commented 2 years ago

Hi @guangyuli-uoe, in order to embed both Chinese and English texts, you could use the laser2.pt model. Can you try the following:

Download a LASER2 model (and in this instance Wolof, but you can disregard that model for now): ./LASER/nllb/download_models.sh wol_Latn
Go to LASER/tasks/embed/embed.sh and set the model directory (model_dir) within embed.sh to the location of where you ran the download script above (i.e. wherever the laser2 model was downloaded to).
Embed both Chinese and English texts using the following: ./embed.sh [infile] [outfile]

guangyuli-uoe commented 2 years ago

hi,

@heffernankevin

really really thanks for your kind replies and suggestions !!! ^^

but i met this problem (Segmentation fault) both in bucc and embed task, i think this is the main error,

(laser2) liguangyu@liguangyudeMacBook-Pro embed % ./embed.sh './1/doc.zh.txt' './emd/doc.zh.emd' 2022-07-19 22:01:16,857 | INFO | embed | spm_model: /Users/liguangyu/LASER/nllb//laser2.spm 2022-07-19 22:01:16,857 | INFO | embed | spm_cvocab: /Users/liguangyu/LASER/nllb/laser2.cvocab 2022-07-19 22:01:16,857 | INFO | embed | loading encoder: /Users/liguangyu/LASER/nllb//laser2.pt ./embed.sh: line 80: 81173 Segmentation fault: 11 python3 ${LASER}/source/embed.py --input ${infile} --encoder ${model_file} --spm-model $spm --output ${outfile} --verbose

Processing BUCC data in .

extract from tar bucc2018-fr-en.sample-gold.tar.bz2
extract from tar bucc2018-fr-en.test.tar.bz2
extract from tar bucc2018-fr-en.training-gold.tar.bz2
extract files ./embed/bucc2018.fr-en.dev in en
extract files ./embed/bucc2018.fr-en.dev in fr
extract files ./embed/bucc2018.fr-en.train in en
extract files ./embed/bucc2018.fr-en.train in fr
extract files ./embed/bucc2018.fr-en.test in en
extract files ./embed/bucc2018.fr-en.test in fr 2022-07-19 21:54:32,562 | INFO | embed | loading encoder: /Users/liguangyu/LASER/models/bilstm.93langs.2018-12-26.pt ./bucc.sh: line 82: 81096 Broken pipe: 13 cat ${txt} 81097 Segmentation fault: 11 | python3 ${LASER}/source/embed.py --encoder ${encoder} --token-lang ${ll} --bpe-codes ${bpe_codes} --output ${enc} --verbose 2022-07-19 21:54:34,664 | INFO | embed | loading encoder: /Users/liguangyu/LASER/models/bilstm.93langs.2018-12-26.pt ./bucc.sh: line 82: 81100 Broken pipe: 13 cat ${txt} 81101 Segmentation fault: 11 | python3 ${LASER}/source/embed.py --encoder ${encoder} --token-lang ${ll} --bpe-codes ${bpe_codes} --output ${enc} --verbose LASER: tool to search, score or mine bitexts
knn will run on CPU (slow)
loading texts ./embed/bucc2018.fr-en.train.txt.fr: 271874 lines, 270775 unique
loading texts ./embed/bucc2018.fr-en.train.txt.en: 369810 lines, 368033 unique Traceback (most recent call last): File "/Users/liguangyu/LASER/source/mine_bitexts.py", line 215, in x = EmbedLoad(args.src_embeddings, args.dim, verbose=args.verbose) File "/Users/liguangyu/LASER/source/embed.py", line 451, in EmbedLoad x = np.fromfile(fname, dtype=np.float32, count=-1) FileNotFoundError: [Errno 2] No such file or directory: './embed/bucc2018.fr-en.train.enc.fr'

heffernankevin commented 2 years ago

Hi @guangyuli-uoe, there seems to be a memory-related issue. Can you try the following command as it might help pinpoint the cause.

python $LASER/source/embed.py --input './1/doc.zh.txt' --output './emd/doc.zh.emd' --encoder  /Users/liguangyu/LASER/nllb/laser2.pt --spm-model /Users/liguangyu/LASER/nllb/laser2.spm --verbose

guangyuli-uoe commented 2 years ago

hi @heffernankevin

thanks for your reply here are the details:

heffernankevin commented 2 years ago

@guangyuli-uoe thanks for checking! This could be related to pytorch. Can you try upgrading pytorch and re-running? Which version of pytorch are you currently running? (pip show torch). There were similar issues on other repos which seem to be related to specific pytorch versions.

guangyuli-uoe commented 2 years ago

hi, @heffernankevin

this is the current version:

Name: torch Version: 1.12.0 Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration Home-page: https://pytorch.org/ Author: PyTorch Team Author-email: packages@pytorch.org License: BSD-3 Location: /Users/liguangyu/opt/anaconda3/envs/laser2/lib/python3.8/site-packages Requires: typing-extensions Required-by: fairseq, sentence-transformers, torchaudio, torchvision

heffernankevin commented 2 years ago

Closing issue as user reported no more segmentation faults and was able to run embed script successfully after upgrading the pytorch version (see comment here).

guangyuli-uoe commented 2 years ago

hi, @heffernankevin

just want to make sure that the model: wol_Latn could handle both Chinese and English ?

heffernankevin commented 2 years ago

The model: "laser2" can handle both Chinese and English (which is already downloaded in your model directory: /Users/liguangyu/LASER/nllb/laser2.pt). This will be used by default using the embed.sh script e.g., embed.sh [infile] [outfile].

facebookresearch / LASER

embed models #211