echogarden-project / echogarden

Easy-to-use speech toolset. Written in TypeScript. Includes tools for synthesis, recognition, alignment, speech translation, language detection, source separation and more.
GNU General Public License v3.0
183 stars 17 forks source link

Error: Sentence has no word entries even though text can be found in audio #24

Closed flo62134 closed 1 year ago

flo62134 commented 1 year ago

Hi,

I'm facing an error with an alignment that I can't manage to solve.

This is the complete output:

Prepare for alignment.. 8338.3ms No language specified. Detecting language.. 160.2ms Language detected: English (en) Get espeak voice list and select best matching voice.. 125.5ms Selected voice: 'gmw/en-US' (en-US, American English) Load alignment module.. 0.2ms Create alignment reference with eSpeak.. 11393.7ms Compute reference MFCC features.. 4791.5ms Compute source MFCC features.. 5662.9ms DTW cost matrix memory size (120s maximum window): 10375.9MiB Warning: Maximum DTW window duration is set to 120.0s, which is smaller than 25% of the source audio duration of 2200.3s. This may lead to suboptimal results in some cases. Consider increasing window length if needed. Align MFCC features using DTW.. 84616.5ms Convert path to timeline.. 40.5ms Error: Sentence has no word entries

I'm uploading an archive that contains the audio and the text files.
Archive.zip

Here is the command that I perform on my computer:
echogarden align "./audiobook_chapters/Project Hail Mary [B08GB66Q3R] - 03 - Chapter 1.mp3" "./ebook_files/text/part0007.html" result.srt result.json

I've already tried to increase the window duration to 200 seconds, but it does not work either.

I've tried running the same command but on smaller files based on the same model, and it works, should I increase the window even further?

flo62134 commented 1 year ago

It does not seem to be a file size issue.
I've ran the command successfully with bigger files.

rotemdan commented 1 year ago

Thanks for the report. The error you are getting is happening during the conversion between a word timeline and a sentence timeline:

function wordTimelineToSegmentSentenceTimeline(...) {
...
    for (const sentenceEntry of sentenceTimeline) {
        const wordTimeline = sentenceEntry.timeline!

        if (wordTimeline.length == 0) {
            throw new Error("Sentence has no word entries")
        }

        sentenceEntry.startTime = wordTimeline[0].startTime
        sentenceEntry.endTime = wordTimeline[wordTimeline.length - 1].endTime
    }
...
}

Since your transcript is extracted from an HTML source, it may be that somehow the extraction produced an empty sentence when parsed (I didn't write the code for the HTML-to-text conversion so I don't know what it does exactly). It could be that some paragraph included only non-word symbols so no words were found there.

There is no particular reason for this error. I can just ignore empty segments / sentences like these for now.

I'll remove this error soon and we'll see where we go from there.

flo62134 commented 1 year ago

Thank you for your response; I wasn't expecting an answer so quickly!
I will try to see what might be causing issues in my files.
Maybe I'll try to convert my files to plain text in order to avoid this kind of bugs ;)

flo62134 commented 1 year ago

Adding in this for loop seems to solve my issue, thanks a lot for helping me!

rotemdan commented 1 year ago

I published a fix in 0.10.13 (now on npm). The new version ignores empty sentences and segments when converting from word timelines to segment / sentence timelines. This is something that should have been done originally, but since my test data included almost all valid segmented sentences, I didn't realize this edge case wasn't handled correctly.

Here is the relevant diff for the change.

I noticed the test audio you gave me, with default window duration (2 minutes) falls out of sync at some point. It starts reasonably synchronized but later consistently lags behind a few seconds at some point. Since the audio duration is 36 minutes, it's possible it's related to the relatively small window size.

I haven't tested it with a larger window yet. I'll try again with a larger one, though my 8GB of RAM already had a hard time handling 10.5GB of memory required (needed to swap to virtual).

Anyway, if the reason for the lag isn't the window size, it may also possibly be due to silences, like in #23. In the future I'll try to ensure this particular test case (as well as the test case in #23) works correctly.

flo62134 commented 1 year ago

Thanks a lot for fixing my issue! 🎉

You can find my personal project here, I'm splitting an audiobook file into pages using your project:
https://github.com/flo62134/splitAudiobookIntoPages

It works like a charm, thanks for creating this project ;)

rotemdan commented 1 year ago

I'm glad to hear you're using my work!

I added a very useful feature today (published in 0.10.14), which allows to significantly reduce both the time and memory requirement of alignment operations.

For alignment, you can now select out of 3 granularity options: high, medium and low (defaults to high).

This setting configures the properties of the MFCC frames used for analysis:

I was surprised that your 36 minute audio sample did very well on low, and even got reasonably accurate word-level timestamps (as with other samples I tried so far).

The memory requirement using low and 5 minute window was only 1GB for the matrix (instead of 26GB with high granularity). It finished processing in 53 seconds, and the DTW part itself took only 11.5 seconds (on my very old PC, should be at least twice faster on most modern PCs).

My command line was:

echogarden align HailMary.mp3 HailMary.html HailMary.srt --dtw.granularity=low --dtw.windowDuration=300

(5 minute window duration)

Because the audio is so long, I converted it to an MKV video so I could watch and seek within it, with subtitles, to ensure they are correctly synchronized. I was surprised how accurate they were throughout the length of the audio. Here is the result (subtitles included):

HailMaryVideo.zip

Edit: I got some of the numbers wrong. Corrected now. high (default) granularity with a window of 5 minutes is actually 26GB! (10.5GB figure was for the default 2 minute window). Also, with medium granularity the matrix size is 4.15GB, for comparison.

flo62134 commented 3 months ago

Thanks a lot for this awesome feature @rotemdan !