echogarden-project / echogarden

Easy-to-use speech toolset. Written in TypeScript. Includes tools for synthesis, recognition, alignment, speech translation, language detection, source separation and more.
GNU General Public License v3.0
142 stars 16 forks source link

Feature Request: use `dtw-ra` with provided wordTimeline #60

Open an-lee opened 1 week ago

an-lee commented 1 week ago

I'm extensively using the align API in my product (Enjoy App, a language learning tool).

Here is the standard procedure:

  1. The user uploads an audio file.
  2. The audio is transcribed using whisper.cpp/OpenAI/Azure to obtain the transcript.
  3. The audio is aligned with the transcript using Echogarden.

Generally, this process works well. However, if the audio contains music or other background noises, the alignments become inaccurate around those sections.

I believe using dtw-ra can resolve this issue.

With the dtw-ra option, Echogarden generates a wordTimeline before creating alignments.

https://github.com/echogarden-project/echogarden/blob/48baa2fa8598c1d405fa38b0ed064840d55a75e4/src/api/Alignment.ts#L180

In my case, the wordTimeline is already generated in Step 2.

Therefore, I hope the wordTimeline can be passed as a parameter when using dtw-ra, like this:

Echogarden.align(audio, transcript, { engine: 'dtw-ra', wordTimeline: wordTimeline })

I hope this clarifies my request. Thank you.

rotemdan commented 1 week ago

Thanks for the suggestion. It seems possible to allow to accept a precomputed recognition timeline in dtw-ra. I'll look into that.

I understand that if you're already running speech recognition on the input anyway, then being able to reuse the recognition results in DTW-RA would save significant amount of computation.

Also, dtw-ra supports any recognition engine, including whisper.cpp, so the recognition stage can be faster than with the built-in Whisper (ONNX-based). Being able to provide a precomputed recognition result would also allow you to apply any sort of processing to the timeline, or use a recognition method that is external to Echogarden (although in that case you'll need to produce the word timeline yourself).

You can also try using the whisper alignment engine. It has been redone in the past few months. Due to its use of specialized forced decoding (not conventional recognition), it may be able to produce better results than DTW-RA in some cases.

an-lee commented 1 week ago

Thanks for your kindly reply.

You can also try using the whisper alignment engine. It has been redone in the past few months. Due to its use of specialized forced decoding (not conventional recognition), it may be able to produce better results than DTW-RA in some cases.

I'll try that.

However, my product is a desktop application. Not every user has a high-performance machine (to run whipser locally); those with slower machines might prefer using web APIs like OpenAI or Azure to generate transcripts. So using dtw-ra with a precomputed recognition result would be ideal for that.