Closed guangyuli-uoe closed 2 years ago
it says that 'LASER2 and all LASER3 encoders are downloaded by default'
where can i find them ?
Hi @guangyuli-uoe, in order to embed both Chinese and English texts, you could use the laser2.pt
model. Can you try the following:
./LASER/nllb/download_models.sh wol_Latn
LASER/tasks/embed/embed.sh
and set the model directory (model_dir
) within embed.sh
to the location of where you ran the download script above (i.e. wherever the laser2 model was downloaded to)../embed.sh [infile] [outfile]
hi,
@heffernankevin
really really thanks for your kind replies and suggestions !!! ^^
but i met this problem (Segmentation fault) both in bucc and embed task, i think this is the main error,
(laser2) liguangyu@liguangyudeMacBook-Pro embed % ./embed.sh './1/doc.zh.txt' './emd/doc.zh.emd' 2022-07-19 22:01:16,857 | INFO | embed | spm_model: /Users/liguangyu/LASER/nllb//laser2.spm 2022-07-19 22:01:16,857 | INFO | embed | spm_cvocab: /Users/liguangyu/LASER/nllb/laser2.cvocab 2022-07-19 22:01:16,857 | INFO | embed | loading encoder: /Users/liguangyu/LASER/nllb//laser2.pt ./embed.sh: line 80: 81173 Segmentation fault: 11 python3 ${LASER}/source/embed.py --input ${infile} --encoder ${model_file} --spm-model $spm --output ${outfile} --verbose
Processing BUCC data in .
Hi @guangyuli-uoe, there seems to be a memory-related issue. Can you try the following command as it might help pinpoint the cause.
python $LASER/source/embed.py --input './1/doc.zh.txt' --output './emd/doc.zh.emd' --encoder /Users/liguangyu/LASER/nllb/laser2.pt --spm-model /Users/liguangyu/LASER/nllb/laser2.spm --verbose
hi @heffernankevin
thanks for your reply here are the details:
2022-07-19 22:50:51,584 | INFO | embed | spm_model: /Users/liguangyu/LASER/nllb/laser2.spm 2022-07-19 22:50:51,584 | INFO | embed | spm_cvocab: /Users/liguangyu/LASER/nllb/laser2.cvocab 2022-07-19 22:50:51,584 | INFO | embed | loading encoder: /Users/liguangyu/LASER/nllb/laser2.pt zsh: segmentation fault python /Users/liguangyu/LASER/source/embed.py --input './1/doc.zh.txt'
@guangyuli-uoe thanks for checking! This could be related to pytorch. Can you try upgrading pytorch and re-running? Which version of pytorch are you currently running? (pip show torch
). There were similar issues on other repos which seem to be related to specific pytorch versions.
hi, @heffernankevin
this is the current version:
Name: torch Version: 1.12.0 Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration Home-page: https://pytorch.org/ Author: PyTorch Team Author-email: packages@pytorch.org License: BSD-3 Location: /Users/liguangyu/opt/anaconda3/envs/laser2/lib/python3.8/site-packages Requires: typing-extensions Required-by: fairseq, sentence-transformers, torchaudio, torchvision
Closing issue as user reported no more segmentation faults and was able to run embed script successfully after upgrading the pytorch version (see comment here).
hi, @heffernankevin
just want to make sure that the model: wol_Latn could handle both Chinese and English ?
The model: "laser2" can handle both Chinese and English (which is already downloaded in your model directory: /Users/liguangyu/LASER/nllb/laser2.pt
). This will be used by default using the embed.sh
script e.g., embed.sh [infile] [outfile]
.
hi,
if i want to embed text in Chinese and text in English,
which model should i download ?