Open JRMeyer opened 2 years ago
We explicitly moved away from a Python script to offer better portability, allowing people to create LMs on device for example for WebThings and dynamically generating small vocabularies. We could extend generate_scorer_package
to either call the KenLM binary or link against it and re-expose the API, but it's a lot of code to write and maintain today for a legacy technology.
Currently, to train a scorer you need to perform two key steps, after you have a cleaned text corpus:
STT/data/lm/generate_lm.py
generate_scorer_package
I see no scenario where you would want to perform step 1 without step 2, and so it would be logical to have a single script like
generate_scorer.py
which can perform both steps 1 and 2