Open an-lee opened 1 week ago
Thanks for the suggestion. It seems possible to allow to accept a precomputed recognition timeline in dtw-ra
. I'll look into that.
I understand that if you're already running speech recognition on the input anyway, then being able to reuse the recognition results in DTW-RA would save significant amount of computation.
Also, dtw-ra
supports any recognition engine, including whisper.cpp
, so the recognition stage can be faster than with the built-in Whisper (ONNX-based). Being able to provide a precomputed recognition result would also allow you to apply any sort of processing to the timeline, or use a recognition method that is external to Echogarden (although in that case you'll need to produce the word timeline yourself).
You can also try using the whisper
alignment engine. It has been redone in the past few months. Due to its use of specialized forced decoding (not conventional recognition), it may be able to produce better results than DTW-RA in some cases.
Thanks for your kindly reply.
You can also try using the whisper alignment engine. It has been redone in the past few months. Due to its use of specialized forced decoding (not conventional recognition), it may be able to produce better results than DTW-RA in some cases.
I'll try that.
However, my product is a desktop application. Not every user has a high-performance machine (to run whipser locally); those with slower machines might prefer using web APIs like OpenAI or Azure to generate transcripts. So using dtw-ra with a precomputed recognition result would be ideal for that.
I'm extensively using the
align
API in my product (Enjoy App, a language learning tool).Here is the standard procedure:
Generally, this process works well. However, if the audio contains music or other background noises, the alignments become inaccurate around those sections.
I believe using
dtw-ra
can resolve this issue.With the
dtw-ra
option, Echogarden generates awordTimeline
before creating alignments.https://github.com/echogarden-project/echogarden/blob/48baa2fa8598c1d405fa38b0ed064840d55a75e4/src/api/Alignment.ts#L180
In my case, the
wordTimeline
is already generated in Step 2.Therefore, I hope the
wordTimeline
can be passed as a parameter when usingdtw-ra
, like this:I hope this clarifies my request. Thank you.