status of project/forced alignment?

jahamed commented 1 year ago

Found this project looking for forced alignment solutions between an audio file and a preexisting transcript. Looks promising!

Wondering what the status of the project was? Getting an error when I try to run echogarden: /.asdf/installs/nodejs/20.2.0/bin/node: bad option: --experimental-wasm-threads

rotemdan commented 1 year ago

I've released the code in late April 2023. The project is under very active development. New features and fixes are added every day.

It's just that I haven't got any feedback for obvious issues up till now. Actually, I did recently discover a number of incredibly obvious issues in several areas that were unreported, and fixed significantly more subtle issues than this!

I'm currently using node 18.0 so I have literally had no idea it's broken on 20.0 :). That's just hilarious.

Seems like one the flags experimental-wasm-threads isn't supported anymore. This flag isn't even used anywhere. This is super-easy for me to fix! I'll post a fix a fixed version soon.

I'm glad I'm finally getting some sort of feedback!

Please post more errors or abnormalities if you encounter them.

rotemdan commented 1 year ago

I installed node 20. Got the same error about the flag. Removed it from the code. Seems to start fine now.

I published version 0.8.4 to npm. You can update with npm update echogarden -g.

Can you tell me if things work fine now? (see if you get any other unexpected errors or behavior). Even subtle ones.

The removal of the flag shouldn't affect much. It's possible I added it because one of the WASM packages I was using. I'm not sure which. Anyway it's not essential.

jahamed commented 1 year ago

@rotemdan Will try the Node 20 version soon. Tested my use case of aligning a transcript (which I am sending to Elevenlabs TTS) with the audio file I get back to generate subtitle files with punctuation. Your tool seems to be working quite well for this! Don't understand the technical details of this library compared to other existing tools 😅 but this is Awesome! Working even for some edge cases that other solutions I tried failed at. Before I was playing around with Gentle for forced alignment.

I see it's mostly command line based right now, will try to hack together a script now to generate .srt files for my audio and see if there are any more errors.

I'm amazed that it keeps special symbols and punctuation from my transcript into the .srt

1
00:00:00,040 --> 00:00:02,069
This (happened) in my [sky]diving years.

I think the only thing feature wise I can ask for now is allowing a character or line limit on the subtitles? For my video generation I don't want subtitles that are too long or on multiple lines. I have some existing code to build subtitles for my previous solution based on word timings, but the word timings in your JSON don't include punctuation. If you have any suggestions there that would be appreciated.

Otherwise keep at this library! Already really filling a void for a node based, easier to setup solution than these HMK/python solutions.

rotemdan commented 1 year ago

The alignment code is almost all written by myself. The dtw engine is almost all in pure JavaScript (except the FFT and eSpeak engine, which are WASM).

Alignment is actually one of my favorite areas, and something I've been interested in since 2018. This is an area I felt that good, simple-to-use solutions aren't really available to end-users.

There was a lot of work and fine-tuning done to get reasonable results, including some difficult work of forking the eSpeak NG tool and exposing some inner information that was needed to enable what I was trying to achieve, and finding workarounds for various bugs.

The dtw-ra engine, which incorporates speech recognition assistance into the alignment as an intermediate step, is actually my own original idea. It is useful for aligning harder inputs, like very noisy audio and music. It actually does a good job at aligning songs to lyrics in some cases (unless the Whisper speech recognition engine itself fails to recognize properly, or goes into a loop - an issue I'm working to improve on right this moment).

The subtitle generation is supposed to be relatively polished, and does have line word limits (though no breaking of very long words, which are more common in non-English languages). There are also settings for maximum number of lines (defaults to 2) and for more subtle ones, but they are not currently exposed to the user (that's on the todo list). It uses cues like sentence start and phrase start to try to emulate the style that a human would most likely edit the subtitles. I actually put a lot of work into that.

I don't include punctuation and spaces as words in the timeline, since their timing is not really significant to the speech, and I don't think most end-users would need it. I could add an option to preserve it, though, but the timing information for punctuation wouldn't mean anything and would most likely be inaccurate.

Also, I don't include text offsets in the timeline, because except for the case of plain text inputs, it wouldn't be that useful. In order to perform word highlighting or synchronization during playback, I search for the offset next occurrence of the word in the text (similar approach to how Microsoft Edge highlights words in its page reader).

Anyway, you can also try the synthesis and recognition features (which actually received more work than even the alignment, in total). The synthesis, in particular, uses the same algorithms to provide the synthesized speech with word and subword-level timing. For all engines (if not available from the engine itself).

Anyway. Thanks for the feedback! I'll prioritize exposing the subtitle options to allow the user to customize maximum line length and count, and other options.

rotemdan commented 1 year ago

The current default settings for subtitle generation, that are currently used by all operations and engines, are:

export const defaultConfig: SubtitlesConfig = {
    format: "srt",
    maxLineCount: 2,
    maxLineWidth: 42,
    minWords: 4,
    maxAddedDuration: 3,
}

All it takes is to expose options to modify the SubtitlesConfig structure based on user settings (code for generation subtitle generation is here). I may be able to get this out later today.

Aside from line count and line width, you'd also be able to specify the minimum words in a line (another aspect that allows the subtitles to appear more human), and maximum extra time to add to a cue (to ensure that even if the alignment detected a very short time for a cue, it'd still be displayed for some minimum duration, before moving to the next one, if possible).

rotemdan commented 1 year ago

I've exposed the subtitle options and published version 0.8.5.

I've tested it a little bit, looks basically fine for now.

I added this reference to the options (only on the section for alignment for now):

Subtitles

subtitles.maxLineCount: maximum number of lines per cue. Defaults to 2
subtitles.maxLineWidth: maximum line width. Defaults to 42
subtitles.minWords: minimum number of words per line to allow a break to be added. Defaults to 4
subtitles.maxAddedDuration: maximum added time (in seconds) that may be added to a cue, to ensure it is viewable. Defaults to 3

You can access these options from the cli using something like:

echogarden align audio.mp3 transcript.txt --subtitles.maxLineCount=1 --subtitles.maxLineWidth=35

I'll add more features or fixes if I find more issues, or they get reported. I haven't worked on the subtitle generation part for a few months now. I'll need to refresh my memory.

Here are some relevant open tasks I have on my todo list:

Captions

If a subtitle is too short and at the end of the audio, try to extend it back if possible (for example, if the previous subtitle is already extended, take back from it)
Split long words if needed
Decide how many punctuation characters to allow before breaking to a new line (currently it's infinite)
Add more clause separators, for even more special cases
Add option to output word or phoneme-level caption files (investigate how it's done on YouTube auto-captions)
Parse VTT's language
Option to generate captions that have word-level timings

jahamed commented 1 year ago

The current default settings for subtitle generation, that are currently used by all operations and engines, are:
export const defaultConfig: SubtitlesConfig = {
  format: "srt",
  maxLineCount: 2,
  maxLineWidth: 42,
  minWords: 4,
  maxAddedDuration: 3,
}
All it takes is to expose options to modify the SubtitlesConfig structure based on user settings (code for generation subtitle generation is here). I may be able to get this out later today.

Aside from line count and line width, you'd also be able to specify the minimum words in a line (another aspect that allows the subtitles to appear more human), and maximum extra time to add to a cue (to ensure that even if the alignment detected a very short time for a cue, it'd still be displayed for some minimum duration, before moving to the next one, if possible).

Wow thanks for the detailed posts! Very cool, I have to dig into the implementation details/approach later for sure. Thanks so much for the quick turnaround on features. I got a decent working copy going already for my subtitle generation, this library is so much better to use already compared to others I've tried.

Will play around with the speech generation features in coming days as well, thank you!

echogarden-project / echogarden

status of project/forced alignment? #1

Captions