coqui-ai / STT

🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.
https://coqui.ai
Mozilla Public License 2.0
2.27k stars 275 forks source link

[POC] Batch lm generation #2248

Closed wasertech closed 2 years ago

wasertech commented 2 years ago

@HarikalarKutusu had made a double of data/lm/generate_lm.py to create mulitple LMs with only one command.

Unfortunately his implementation was rather lacking so I made the following changes:

So much so that you can now do the following.

python data/lm/generate_lm_batch.py \
    --input_txt /mnt/extracted/sources_lm.txt \
    --output_dir /mnt/lm/ \
    --top_k_list 30000-50000 \
    --arpa_order_list "2-3" \
    --max_arpa_memory "85%" \
    --arpa_prune_list "0|0|2-0|0|3" \
    --binary_a_bits 255 \
    --binary_q_bits 8 \
    --binary_type trie \
    --kenlm_bins /code/kenlm/build/bin/ \
    -j 12

Needs libboost-program-options-dev and libboost-thread-dev installed or lmplz crashes with:

libboost_program_options.so.1.71.0: cannot open shared object file: No such file or directory
libboost_thread.so.1.71.0:  cannot open shared object file: No such file or directory
CLAassistant commented 2 years ago

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

:white_check_mark: HarikalarKutusu
:white_check_mark: wasertech
:x: Danny Waser


Danny Waser seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

wasertech commented 2 years ago

Wat? I really don't know why GitHub even allowed me to push with this weird non-user...