common-voice / commonvoice-fr

Tooling for producing French dataset for Common Voice
100 stars 24 forks source link

Switch to Coqui STT 1.4.0 #163

Closed wasertech closed 1 year ago

wasertech commented 2 years ago

This branch implements everything needed to train STT models for french using CommonVoice 9.0 with STT version 1.4.0.

Notes

Checkout the released models from this branch: STT French v0.9.

I've added the import_cv_perso.sh importer script to download personal CV data and ease the process of fine-tuning from checkpoints. See this commit and this article on Discourse.

I've also added a custom python script for lm_optimizer to catch the results of the optimization and save them to disk so we can use them during testing and exporting steps.

train.sh has been split into train.sh, test.sh and export.sh. See this commit.

wasertech commented 2 years ago

Managed to make this branch export a model from scratch. See the full logs here.

wasertech commented 2 years ago

I've updated the stt_branch to be the lastest alpha of STT 1.4.0 so we'll have to update it once it's considered stable.

wasertech commented 2 years ago

STT 1.4.0 was released as stable! I've updated stt_branch accordingly. This branch stt140-cv9 is now completed.

Full build logs and checks.

Version 10 of CV is out so I'll probably make another branch for it (I'll probably wait for more affordable energy to train cv-fr-10 though).

wasertech commented 1 year ago

This branch made the mistake to delete commonvoice-fr/DeepSpeech/ to create commonvoice-fr/STT/. It is now obsolete thanks to #168.