flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.37k stars 1.01k forks source link

Segmentation fault after compiling kenlm with -DKENLM_MAX_ORDER=20 #945

Closed innarid closed 3 years ago

innarid commented 3 years ago

Hi! I wanted to use character-level kenlm, but I got segmentation fault. I found in issue #875 that the reason was DKENLM_MAX_ORDER=20 in building kenlm. So now I use 6-gram model, but it is too small for character-level. Is there any possibility to increase DKENLM_MAX_ORDER?

tlikhomanenko commented 3 years ago

As you pointed you can just rebuild kenlm including -DKENLM_MAX_ORDER=20 (or any number ngrams you want to support) and then rebuild wav2letter. If you are doing this in docker image - just simply repeat steps on kenlm installation from the Dockerfile with specifying this additional option and rebuild wav2letter.

The problem with ngram order is that kenlm during compilation using different constants depending on the max number of ngrams.

innarid commented 3 years ago

After rebuilding kenlm with -DKENLM_MAX_ORDER=20 and rebuilding flashlight python bindings I get segmentation fault in decoding

tlikhomanenko commented 3 years ago

Can you post log of decoder? Could you also post with which build of kenlm and how you trained the lm model?

innarid commented 3 years ago

I am using kenlm form 0c4dd4e8a29a9bcaf22d971a83f4974f1a16d6d9 commit, I built it by following commands: cmake .. -DKENLM_MAX_ORDER=20 sudo make install -j 4 I trained the lm model: ./bin/lmplz -o 20 -S 7G -T /tmp < $corpus_path > $lm_dir/lm.arpa --discount_fallback My text corpus looks like that: н е в р и л а д н о к р и к н у л г а р р и с к а з а л я Symbol is for space. After that I rebuilt flashlight python bindings. And I got the following error Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

tlikhomanenko commented 3 years ago

Installation and training the lm looks fine for me.

Could you post the whole log you get on the screen while running python code? Also one thing to try is to build kenlm with -DCMAKE_POSITION_INDEPENDENT_CODE=ON and retrain lm with it + rebuild bindings too.

innarid commented 3 years ago

That's the only row of log. Sometimes it's: Segmentation fault (core dumped) Rebuilding with -DCMAKE_POSITION_INDEPENDENT_CODE=ON didn't help

tlikhomanenko commented 3 years ago

Could you run decoding with --logtostderr=true, so that it should print log into the screen and post here?

Also could you confirm that flashlight build is using your rebuild/reinstalled kenlm and not trying to build it by itself (we are doing this if kenlm is not found)?

innarid commented 3 years ago

errorlog.txt I saved log of error int txt file. Yes, I uninstalled kenlm from system, built kenlm locally in folder and passed KENLM_ROOT to flashlight.

tlikhomanenko commented 3 years ago

Ohh, you are running with python bindings? Let me try this in the docker if I can repro reinstallation with max order=20.

innarid commented 3 years ago

Yeap! I need python bindings

tlikhomanenko commented 3 years ago

Which version/commit are using now for fl and w2l?

innarid commented 3 years ago

Commit 4d74995a285839f818e4aad501767946096bf493 for flashlight

tlikhomanenko commented 3 years ago

Ok, seems I found the fix. The problem that fl should be compiled also with this kenlm flag. So the fix for you is adding -DKENLM_MAX_ORDER=20 in the https://github.com/facebookresearch/flashlight/blob/master/bindings/python/setup.py#L81. Let me know if this fixes the issue.

innarid commented 3 years ago

Thank you so much! This solved the issue