coqui-ai / STT

🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.
https://coqui.ai
Mozilla Public License 2.0
2.27k stars 275 forks source link

Bug: --alphabet required with --force_bytes_output_mode off but not accepted as a CLI option #2327

Closed poohsen closed 1 year ago

poohsen commented 1 year ago

Welcome to the 🐸STT project! We are excited to see your interest, and appreciate your support!

This repository is governed by the Contributor Covenant Code of Conduct. For more details, see the CODE_OF_CONDUCT.md file.

If you've found a bug, please provide the following information:

Describe the bug Running ./generate_scorer_package --alphabet results in this error:

terminate called after throwing an instance of 'boost::wrapexcept<boost::program_options::unknown_option>'
  what():  unrecognised option '--alphabet'
Aborted

yet running it with --force_bytes_output_mode off refers to that very flag being required:

Doesn't look like a character based (Bytes Are All You Need) model.
No --alphabet file specified, not using bytes output mode, can't continue.

To Reproduce Steps to reproduce the behavior:

  1. grab the 1.4.0 release of the native client (mine was this)
  2. unzip and run ./generate_scorer_package --alphabet
  3. See error

Expected behavior As with older versions, the --alphabet option should be accepted and made use of.

Environment (please complete the following information):

Additional context generate_scorer_package --help also doesn't mention the --alphabet option either, even though the generate_scorer_package.cpp code does list it as an option in its main method.

HarikalarKutusu commented 1 year ago

You are probably using an older example here. The --alphabet flag in generate_scorer_package.py is replaced with the --checkpoint flag. Actually, it does not rely on checkpoint data, but the checkpoint directory contains the alphabet and it uses it.

Please see here: https://stt.readthedocs.io/en/latest/playbook/SCORER.html

wasertech commented 1 year ago

Closing as it’s not an issue and @HarikalarKutusu pointed out the error in op’s command flow

poohsen commented 1 year ago

Hi, sorry for the late reply. Checking to the --checkpoint flag indeed helped me out. (I was previously ignoring that option because I didn't have any checkpoint files and the language model itself is passed separately so it felt like it didn't apply)

So there's no bug indeed. Note however that the error message you get when using --force_bytes_output_mode off without passing the checkpoint option is not very helpful:

No --alphabet file specified, not using bytes output mode, can't continue.

How about "No alphabet file found and bytes output mode is off, can't continue. Did you pass a checkpoint directory?"

wasertech commented 1 year ago

... the error message you get when using --force_bytes_output_mode off without passing the checkpoint option is not very helpful ...

I've updated the error message like so:

generate_scorer_package --lm /mnt/lm/lm.binary --vocab /mnt/lm/vocab-500000.txt --package /mnt/lm/kenlm.scorer --default_alpha 0 --default_beta 0
500000 unique words read from vocabulary file.
Doesn't look like a character based (Bytes Are All You Need) model.
--force_bytes_output_mode was not specified, using value infered from vocabulary contents: false
No --checkpoint path specified, not using bytes output mode, can't continue.
Checkpoint path must contain an alphabet.
Start by creating an alphabet for your models using coqui_stt_training.util.check_characters if needed.

    python -m coqui_stt_training.util.check_characters \
                                --csv-files ... \
                                --alphabet-format | grep -v '^#' | sort -n > models/alphabet.txt

This will create an alphabet models/alphabet.txt.
Now rerun this script by giving models/ as the checkpoint path.

    generate_scorer_package  \
                --checkpoint models/ \
                ...

It's already on main but won't be introduced into the stable code base before version 1.5.0.

For those who want this patch early, you'll need to build generate_scorer_package manually since we pull the pre-built binary file from the latest release.

https://github.com/coqui-ai/STT/blob/15bef2788a0607c78b41d14fb00f5eb5b9cd55d7/Dockerfile.train#L84-L86

Checkout the docs to build binaries or this comment I made under my logs for #2330 which introduced the reprog.