facebookresearch / LASER

Language-Agnostic SEntence Representations
Other
3.6k stars 463 forks source link

embed models #211

Closed guangyuli-uoe closed 2 years ago

guangyuli-uoe commented 2 years ago

hi,

if i want to embed text in Chinese and text in English,

which model should i download ?

guangyuli-uoe commented 2 years ago

it says that 'LASER2 and all LASER3 encoders are downloaded by default'

where can i find them ?

heffernankevin commented 2 years ago

Hi @guangyuli-uoe, in order to embed both Chinese and English texts, you could use the laser2.pt model. Can you try the following:

  1. Download a LASER2 model (and in this instance Wolof, but you can disregard that model for now): ./LASER/nllb/download_models.sh wol_Latn
  2. Go to LASER/tasks/embed/embed.sh and set the model directory (model_dir) within embed.sh to the location of where you ran the download script above (i.e. wherever the laser2 model was downloaded to).
  3. Embed both Chinese and English texts using the following: ./embed.sh [infile] [outfile]
guangyuli-uoe commented 2 years ago

hi,

@heffernankevin

really really thanks for your kind replies and suggestions !!! ^^

but i met this problem (Segmentation fault) both in bucc and embed task, i think this is the main error,

(laser2) liguangyu@liguangyudeMacBook-Pro embed % ./embed.sh './1/doc.zh.txt' './emd/doc.zh.emd' 2022-07-19 22:01:16,857 | INFO | embed | spm_model: /Users/liguangyu/LASER/nllb//laser2.spm 2022-07-19 22:01:16,857 | INFO | embed | spm_cvocab: /Users/liguangyu/LASER/nllb/laser2.cvocab 2022-07-19 22:01:16,857 | INFO | embed | loading encoder: /Users/liguangyu/LASER/nllb//laser2.pt ./embed.sh: line 80: 81173 Segmentation fault: 11 python3 ${LASER}/source/embed.py --input ${infile} --encoder ${model_file} --spm-model $spm --output ${outfile} --verbose

Processing BUCC data in .

heffernankevin commented 2 years ago

Hi @guangyuli-uoe, there seems to be a memory-related issue. Can you try the following command as it might help pinpoint the cause.

python $LASER/source/embed.py --input './1/doc.zh.txt' --output './emd/doc.zh.emd' --encoder  /Users/liguangyu/LASER/nllb/laser2.pt --spm-model /Users/liguangyu/LASER/nllb/laser2.spm --verbose
guangyuli-uoe commented 2 years ago

hi @heffernankevin

thanks for your reply here are the details:

2022-07-19 22:50:51,584 | INFO | embed | spm_model: /Users/liguangyu/LASER/nllb/laser2.spm 2022-07-19 22:50:51,584 | INFO | embed | spm_cvocab: /Users/liguangyu/LASER/nllb/laser2.cvocab 2022-07-19 22:50:51,584 | INFO | embed | loading encoder: /Users/liguangyu/LASER/nllb/laser2.pt zsh: segmentation fault python /Users/liguangyu/LASER/source/embed.py --input './1/doc.zh.txt'

heffernankevin commented 2 years ago

@guangyuli-uoe thanks for checking! This could be related to pytorch. Can you try upgrading pytorch and re-running? Which version of pytorch are you currently running? (pip show torch). There were similar issues on other repos which seem to be related to specific pytorch versions.

guangyuli-uoe commented 2 years ago

hi, @heffernankevin

this is the current version:

Name: torch Version: 1.12.0 Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration Home-page: https://pytorch.org/ Author: PyTorch Team Author-email: packages@pytorch.org License: BSD-3 Location: /Users/liguangyu/opt/anaconda3/envs/laser2/lib/python3.8/site-packages Requires: typing-extensions Required-by: fairseq, sentence-transformers, torchaudio, torchvision

heffernankevin commented 2 years ago

Closing issue as user reported no more segmentation faults and was able to run embed script successfully after upgrading the pytorch version (see comment here).

guangyuli-uoe commented 2 years ago

hi, @heffernankevin

just want to make sure that the model: wol_Latn could handle both Chinese and English ?

heffernankevin commented 2 years ago

The model: "laser2" can handle both Chinese and English (which is already downloaded in your model directory: /Users/liguangyu/LASER/nllb/laser2.pt). This will be used by default using the embed.sh script e.g., embed.sh [infile] [outfile].