Migrate to whisper-timestamp?

fave-asr currently uses WhisperX, an extension of openai-whisper, for transcription. The problems and current solution are covered in #10:

Disfluencies and non-speech sounds are handled poorly....

Phrase-level and word-level time stamps are not sufficiently accurate.... ...By providing a workflow for fine-tuning the model, the problems should hopefully be mitigated.

Another option, is to try a different transcription system. The linto-ai/whisper-timestamp package claims to address these issues while also making the program more memory efficient and multi-lingual.

My current belief is that completing #10 is still the best short term option, largely because no matter what system we use, the ability to fine-tune it on your transcribed data will be important. If it works well enough, it may push back the need for this migration. Long-term, however, I think whisper-timestamp is the better system.

Forced-Alignment-and-Vowel-Extraction / fave-asr

Migrate to whisper-timestamp? #11