coqui-ai / STT

🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.
https://coqui.ai
Mozilla Public License 2.0
2.28k stars 278 forks source link

Feature request: make scorer training a single script #2158

Open JRMeyer opened 2 years ago

JRMeyer commented 2 years ago

Currently, to train a scorer you need to perform two key steps, after you have a cleaned text corpus:

  1. train a KenLM model with STT/data/lm/generate_lm.py
  2. package the model for deployment with generate_scorer_package

I see no scenario where you would want to perform step 1 without step 2, and so it would be logical to have a single script like generate_scorer.py which can perform both steps 1 and 2

reuben commented 2 years ago

We explicitly moved away from a Python script to offer better portability, allowing people to create LMs on device for example for WebThings and dynamically generating small vocabularies. We could extend generate_scorer_package to either call the KenLM binary or link against it and re-expose the API, but it's a lot of code to write and maintain today for a legacy technology.