Open zh-plus opened 1 year ago
Hi @zh-plus, this function is automatically used when you use the word_timestamps
parameter, an alignment is done before getting the words.
On my end, I saw that using by default the word_timestamps
parameter provides outputs with much more quality and precision. You can always filter out the words if you don't need them at the end.
Btw, you have to use the original transcribe
function which could not be what you are looking for.
Currently, I am exploring how to use
faster-whisper
for performing forced-alignment between audio and ground-truth transcription texts. I foundWhisperModel.find_alignment
available for this purpose. But I got stuck at the last step (maybe):The main issue is how to split the ground-truth transcription into small segments aligned with the audio. Extracting the
WhisperModel.find_alignment
function is not great, but at least I can give it a try.May I ask for your suggestions? Thank you for your fantastic project!