Bug: Segmentation Fault

Describe the bug

Using the generate_lm.py script returns status 127. The boost libraries cannot be found.

python3 /code/data/lm/generate_lm.py \
  --input_txt /app/assets/corpus.txt \
  --output_dir /app/assets/scorer \
  --top_k 1000 \
  --kenlm_bins /code/kenlm/build/bin \
  --arpa_order 4 \
  --arpa_prune "0" \
  --max_arpa_memory "85%" \
  --binary_a_bits 255 \
  --binary_q_bits 8 \
  --binary_type trie \
  --discount_fallback

Converting to lowercase and counting word occurrences ...
| |#                                                                                                                                                                | 407 Elapsed Time: 0:00:00

Saving top 1000 words ...

Calculating word statistics ...
  Your text file has 3134 words in total
  It has 39 unique words
  Your top-1000 words are 100.0000 percent of all words
  Your most common word "the" occurred 406 times
  The least common word in your top-k is "off" with 2 times
  The first word with 4 occurrences is "device" at place 36

Creating ARPA file ...
/code/kenlm/build/bin/lmplz: error while loading shared libraries: libboost_program_options.so.1.71.0: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "/code/data/lm/generate_lm.py", line 231, in <module>
    main()
  File "/code/data/lm/generate_lm.py", line 215, in main
    build_lm(args, data_lower, vocab_str)
  File "/code/data/lm/generate_lm.py", line 98, in build_lm
    subprocess.check_call(subargs)
  File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/code/kenlm/build/bin/lmplz', '--order', '4', '--temp_prefix', '/app/assets/scorer', '--memory', '85%', '--text', '/app/assets/scorer/lower.txt.gz', '--arpa', '/app/assets/scorer/lm.arpa', '--prune', '0', '--discount_fallback']' returned non-zero exit status 127.

Installing the boost libraries and running the command again leads to a segmentation fault.

apt-get update
apt-get install -y libboost-all-dev

Converting to lowercase and counting word occurrences ...
| |#                                                                                                                                                                | 407 Elapsed Time: 0:00:00

Saving top 1000 words ...

Calculating word statistics ...
  Your text file has 3134 words in total
  It has 39 unique words
  Your top-1000 words are 100.0000 percent of all words
  Your most common word "the" occurred 406 times
  The least common word in your top-k is "off" with 2 times
  The first word with 4 occurrences is "device" at place 36

Creating ARPA file ...
=== 1/5 Counting and sorting n-grams ===
Reading /app/assets/scorer/lower.txt.gz
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Traceback (most recent call last):
  File "/code/data/lm/generate_lm.py", line 231, in <module>
    main()
  File "/code/data/lm/generate_lm.py", line 215, in main
    build_lm(args, data_lower, vocab_str)
  File "/code/data/lm/generate_lm.py", line 98, in build_lm
    subprocess.check_call(subargs)
  File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/code/kenlm/build/bin/lmplz', '--order', '4', '--temp_prefix', '/app/assets/scorer', '--memory', '85%', '--text', '/app/assets/scorer/lower.txt.gz', '--arpa', '/app/assets/scorer/lm.arpa', '--prune', '0', '--discount_fallback']' died with <Signals.SIGSEGV: 11>.

Executing lmplz directly.

/code/kenlm/build/bin/lmplz \
  --order 4 \
  --temp_prefix /app/assets/scorer \
  --memory 85% \
  --text /app/assets/scorer/lower.txt.gz \
  --arpa /app/assets/scorer/lm.arpa \
  --prune 0 \
  --discount_fallback

=== 1/5 Counting and sorting n-grams ===
Reading /app/assets/scorer/lower.txt.gz
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Segmentation fault

To Reproduce

Dockerfile

FROM ghcr.io/coqui-ai/stt-train:v1.4.0

Corpus

set value of property to zero
set value of property to one
set value of property to two
set value of property to three
...
set the value of the property to one hundred
switch device on
switch device off
switch the device on
switch the device off

coqui-ai / STT

Bug: Segmentation Fault #2303