facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.13k stars 6.36k forks source link

Segementation fault while decoding with Kenlm #2628

Open anvgarg opened 3 years ago

anvgarg commented 3 years ago

🐛 Bug

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

  1. Run cmd python examples/speech_recognition/infer.py ~/fairseq/benchmark --task audio_pretraining --nbest 1 --path ~/fairseq/model/checkpoint_last.pt --results-path ~/fairseq/benchmark --w2l-decoder kenlm --lm-model ~/friday/training/acoustic/language_model/en_corpus.arpa --lm-weight 2 --word-score -1 --sil-weight 0 --criterion ctc --labels ltr --max-tokens 159631 --post-process letter --lexicon model/lexicon.lexico
  2. See error

Code sample

Expected behavior

Environment

Additional context

I am trying to use wav2vec2.0 for my own dataset. While using the inference script it works fine with --w2l-decoder viterbi but gives me a segmentation fault when I try it with --w2l-decoder kenlm.I have followed all the installation steps listed in the wav2letter repository. I tried to debug and found that the trie inserts are causing a segmentation fault.

wahyubram82 commented 3 years ago

yes, the bug I found it too..., in inserting to trie command...

still no clue to solve it until now..., I will try ask in kenlm issue...

or maybe cc @alexeib have a solution for it?

alexeib commented 3 years ago

i think kenlm by default only supports up to 5gram, and then you need to recompile it with diff flags to support more grams than that. dunno if that is the issue here though. stacktrace might be helpful

lucasgris commented 3 years ago

@wahyubram82 @anvgarg I was facing the same issue here, but I was able to get around with docker.

  1. Configure a docker container as explained in https://github.com/facebookresearch/wav2letter/wiki/Building-Python-bindings 1.1 You can use either wav2letter/wav2letter:cuda-latest or wav2letter/wav2letter:cpu-latest ; 1.2 Install additional packages with pip, for example, soundfile, packaging and torch.; 1.3 Install fairseq;
  2. Proceed with the inference normally 2.1 It will show an UserWarning as if the Python Bindings are not installed, ignore it; 2.2 Both CPU and GPU containers worked for me, but for some reason (I didn't investigate further) the WER I get was too high if compared to the GPU container.

If you have problems related to locale settings (unicode errors in Python), you can try to execute the following commands in your container:

apt-get update -y
apt-get install --reinstall -y locales
sed -i 's/# pl_PL.UTF-8 UTF-8/pl_PL.UTF-8 UTF-8/' /etc/locale.gen
locale-gen pl_PL.UTF-8
export LANG=pl_PL.UTF-8
export LANGUAGE=pl_PL
export LC_ALL=pl_PL.UTF-8

I also experienced some issues with the python dataloader, to resolve it, you can try to infer using --num-workers 0 as infer.py parameter.

Initially I was trying to build kenlm and the wav2letter python bindings but it didn't work. I am not sure if I made some mistake building these libraries or if following the instructions by tlikhomanenko in https://github.com/facebookresearch/wav2letter/issues/775 could cause some error. I needed to made that modification in order to viterbi work. I also use USE_CUDA=0 when building locally.

HJJ256 commented 3 years ago

i think kenlm by default only supports up to 5gram, and then you need to recompile it with diff flags to support more grams than that. dunno if that is the issue here though. stacktrace might be helpful

Hello @alexeib, if I compile KENLM with -DKENLM_MAX_ORDER=25, I get a segmentation default (core dumped) or corrupted size vs prev_size error when running examples/speech_recognition/infer.py with a CTC Wav2Vec 2.0 model, Any word/char level LM and corresponding lexicon.

Please help

HJJ256 commented 3 years ago

i think kenlm by default only supports up to 5gram, and then you need to recompile it with diff flags to support more grams than that. dunno if that is the issue here though. stacktrace might be helpful

Hello @alexeib, if I compile KENLM with -DKENLM_MAX_ORDER=25, I get a segmentation default (core dumped) or corrupted size vs prev_size error when running examples/speech_recognition/infer.py with a CTC Wav2Vec 2.0 model, Any word/char level LM and corresponding lexicon.

Please help

@alexeib It would be kind of you to reply

maggieezzat commented 3 years ago

Has anyone found a solution to this problem? I get Segmentation fault (core dumped) when decoding with a 20-gram character based language model.

kamirdin commented 2 years ago

@maggieezzat @HJJ256 @alexeib @lucasgris @anvgarg Hi guys, I get Segmentation fault (core dumped) when run the infer.py:

from flashlight.lib.text.decoder imoprt KenLM
lm=KenLM(lm_path,word_dict)

I use the default setting about KENLM_MAX_ORDER still got this error :cry:

imMonicaShaw commented 4 months ago

Hello everyone, has anyone solved this problem yet? When I use a 20-gram language model in kenLM, the following error occurs: RuntimeError: /root/anaconda3/envs/meta/temp_dir/kenlm/lm/model.cc:49 in void lm::ngram::detail::{anonymous}::CheckCounts(const std::vector&) threw FormatLoadException because `counts.size() > 6'. This model has order 20 but KenLM was compiled to support up to 6. If your build system supports changing KENLM_MAX_ORDER, change it there and recompile. With cmake: cmake -DKENLM_MAX_ORDER=10 .. With Moses: bjam --max-kenlm-order=10 -a Otherwise, edit lm/max_order.hh.