Closed flo62134 closed 1 year ago
It does not seem to be a file size issue.
I've ran the command successfully with bigger files.
Thanks for the report. The error you are getting is happening during the conversion between a word timeline and a sentence timeline:
function wordTimelineToSegmentSentenceTimeline(...) {
...
for (const sentenceEntry of sentenceTimeline) {
const wordTimeline = sentenceEntry.timeline!
if (wordTimeline.length == 0) {
throw new Error("Sentence has no word entries")
}
sentenceEntry.startTime = wordTimeline[0].startTime
sentenceEntry.endTime = wordTimeline[wordTimeline.length - 1].endTime
}
...
}
Since your transcript is extracted from an HTML source, it may be that somehow the extraction produced an empty sentence when parsed (I didn't write the code for the HTML-to-text conversion so I don't know what it does exactly). It could be that some paragraph included only non-word symbols so no words were found there.
There is no particular reason for this error. I can just ignore empty segments / sentences like these for now.
I'll remove this error soon and we'll see where we go from there.
Thank you for your response; I wasn't expecting an answer so quickly!
I will try to see what might be causing issues in my files.
Maybe I'll try to convert my files to plain text in order to avoid this kind of bugs ;)
Adding in this for
loop seems to solve my issue, thanks a lot for helping me!
I published a fix in 0.10.13
(now on npm). The new version ignores empty sentences and segments when converting from word timelines to segment / sentence timelines. This is something that should have been done originally, but since my test data included almost all valid segmented sentences, I didn't realize this edge case wasn't handled correctly.
Here is the relevant diff for the change.
I noticed the test audio you gave me, with default window duration (2 minutes) falls out of sync at some point. It starts reasonably synchronized but later consistently lags behind a few seconds at some point. Since the audio duration is 36 minutes, it's possible it's related to the relatively small window size.
I haven't tested it with a larger window yet. I'll try again with a larger one, though my 8GB of RAM already had a hard time handling 10.5GB of memory required (needed to swap to virtual).
Anyway, if the reason for the lag isn't the window size, it may also possibly be due to silences, like in #23. In the future I'll try to ensure this particular test case (as well as the test case in #23) works correctly.
Thanks a lot for fixing my issue! 🎉
You can find my personal project here, I'm splitting an audiobook file into pages using your project:
https://github.com/flo62134/splitAudiobookIntoPages
It works like a charm, thanks for creating this project ;)
I'm glad to hear you're using my work!
I added a very useful feature today (published in 0.10.14
), which allows to significantly reduce both the time and memory requirement of alignment operations.
For alignment, you can now select out of 3 granularity options: high
, medium
and low
(defaults to high
).
This setting configures the properties of the MFCC frames used for analysis:
high
: (25ms width, 10ms hop)medium
: (50ms width, 25ms hop)low
: (100ms width, 50ms hop)I was surprised that your 36 minute audio sample did very well on low
, and even got reasonably accurate word-level timestamps (as with other samples I tried so far).
The memory requirement using low
and 5 minute window was only 1GB for the matrix (instead of 26GB with high
granularity). It finished processing in 53 seconds, and the DTW part itself took only 11.5 seconds (on my very old PC, should be at least twice faster on most modern PCs).
My command line was:
echogarden align HailMary.mp3 HailMary.html HailMary.srt --dtw.granularity=low --dtw.windowDuration=300
(5 minute window duration)
Because the audio is so long, I converted it to an MKV video so I could watch and seek within it, with subtitles, to ensure they are correctly synchronized. I was surprised how accurate they were throughout the length of the audio. Here is the result (subtitles included):
Edit: I got some of the numbers wrong. Corrected now. high
(default) granularity with a window of 5 minutes is actually 26GB! (10.5GB figure was for the default 2 minute window). Also, with medium
granularity the matrix size is 4.15GB, for comparison.
Thanks a lot for this awesome feature @rotemdan !
Hi,
I'm facing an error with an alignment that I can't manage to solve.
This is the complete output:
I'm uploading an archive that contains the audio and the text files.
Archive.zip
Here is the command that I perform on my computer:
echogarden align "./audiobook_chapters/Project Hail Mary [B08GB66Q3R] - 03 - Chapter 1.mp3" "./ebook_files/text/part0007.html" result.srt result.json
I've already tried to increase the window duration to 200 seconds, but it does not work either.
I've tried running the same command but on smaller files based on the same model, and it works, should I increase the window even further?