Workflow for fine-tuning language models

In offline testing, there seem to be two main problems:

Disfluencies and non-speech sounds are handled poorly. At best, they are simply removed, and at worst they are recognized as a word not present in the transcript
Phrase-level and word-level time stamps are not sufficiently accurate. Fricative-initial words often have their beginning cut off. Stop-initial words often start just before the burst or vowel (and for aspirated stops, same thing as fricative initial).

The hypothesis is that the Wav2Vec2 models are trained on speeches or audio-book recordings which have fewer disfluencies and speech overlap than the conversational data seen in sociolinguistic interviews. The transcriptions and alignments of those base models are also likely less accurate than research. By providing a workflow for fine-tuning the model, the problems should hopefully be mitigated.

Forced-Alignment-and-Vowel-Extraction / fave-asr

Workflow for fine-tuning language models #10