echogarden-project / echogarden

Easy-to-use speech toolset. Written in TypeScript. Includes tools for synthesis, recognition, alignment, speech translation, language detection, source separation and more.
GNU General Public License v3.0
163 stars 17 forks source link

Feature request: Only generate alignment of segments #61

Closed flo62134 closed 1 month ago

flo62134 commented 1 month ago

Hi,

First, congratulations for this awesome package, it works flawlessly, it helps me a lot for my personal project :)

When I run the following command: echogarden align "./audiobook_chapters/{audiobook_filename}" "./ebook_files/text/{ebook_filename}" "{alignment_json_filename}" --dtw.granularity=low --dtw.windowDuration=900

This creates a JSON file that is huge.
I only need the timestamps for the items of "type": "segment".
Would that be possible to do?

I tried using parameter mode=segment and subtitles.mode=segment, but both don't work.

I don't really know how this package works:
Would that save some time to only generate the alignment of segments or would that only make the resulting JSON smaller?

rotemdan commented 1 month ago

The code internally produces all the events, up to the phoneme level, since that makes things simple and it's not really that expensive to do. For most engines (possibly except some recognition engines), there is no significant performance advantage to limit the events to be more coarse-grained.

Even though the user may only need segment or sentence boundary for the subtitles, the word-level timestamps are still used to more precisely locate the boundaries for sentences and segments, and for subtitles, they are used to exactly set the start and end time of each cue (since cues are segmented dynamically based on text size and word boundaries).

Also, the playback done in the CLI uses word-level timestamps. It wouldn't be a great user experience without it.

The API returns the full, detailed timeline (up to phoneme level) as an object in memory, not a JSON file. This object is generated internally anyway, so there's not much benefit in trying to simplify it. The CLI simply serializes it to JSON. Because I try to have full parity between the API and CLI, I output a JSON file that's identical to the object the API returns.

I could add option to produce a smaller JSON file, only on the CLI, say up to only segment or sentence granularity. The time to serialize this would be slightly reduced, but the saving wouldn't likely to be that significant relative the time of the processing itself.

flo62134 commented 1 month ago

Thanks @rotemdan for this very detailed answer.
If there's no way to not generate the alignment of words, I don't think excluding the results from the resulting JSON would save much time indeed.

I guess you can close this issue, thanks again for your help!