Forced-Alignment-and-Vowel-Extraction / fave-asr

Interface for automated transcription and time alignment of conversational interview data
https://forced-alignment-and-vowel-extraction.github.io/fave-asr/
GNU General Public License v3.0
3 stars 0 forks source link

Migrate to whisper-timestamp? #11

Closed chrisbrickhouse closed 6 months ago

chrisbrickhouse commented 6 months ago

fave-asr currently uses WhisperX, an extension of openai-whisper, for transcription. The problems and current solution are covered in #10:

  • Disfluencies and non-speech sounds are handled poorly....
  • Phrase-level and word-level time stamps are not sufficiently accurate.... ...By providing a workflow for fine-tuning the model, the problems should hopefully be mitigated.

Another option, is to try a different transcription system. The linto-ai/whisper-timestamp package claims to address these issues while also making the program more memory efficient and multi-lingual.

My current belief is that completing #10 is still the best short term option, largely because no matter what system we use, the ability to fine-tune it on your transcribed data will be important. If it works well enough, it may push back the need for this migration. Long-term, however, I think whisper-timestamp is the better system.