TejasQ / gen-subs

Generate subtitles for your videos with secure, on-device machine learning models.
207 stars 17 forks source link

The generated subtitles are incomplete #12

Open iceberg53 opened 7 months ago

iceberg53 commented 7 months ago

I noticed while using gen-subs that the generated subtitles do not cover the entire video.

For instance, the video referenced in the [issue number 4] (https://github.com/TejasQ/gen-subs/issues/4) suffers from the same problem. And this seems to occur only in the last part of videos.

In the video below from the issue I mentioned earlier, the subtitles stop appearing about 10s before the end.

https://private-user-images.githubusercontent.com/725120/284995522-27e4ef6d-6cf2-400f-8e0f-f0710e2534b4.mp4?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDc3MDA2OTAsIm5iZiI6MTcwNzcwMDM5MCwicGF0aCI6Ii83MjUxMjAvMjg0OTk1NTIyLTI3ZTRlZjZkLTZjZjItNDAwZi04ZTBmLWYwNzEwZTI1MzRiNC5tcDQ_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwMjEyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDIxMlQwMTEzMTBaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0xY2I0YmVjMzlhYjAxZjliZDFjM2ZjZDE4YTE0ODg1Zjk2Y2IxMGZjNTU4ZTM0ODYzYjk1NzA3YzM0NTNiMTYwJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.9Sm7WdT62MK5u0puC_z3nxUsaMEM82ZEMzCLtSyjvx0

ivanov84 commented 3 weeks ago

Modify file createTextFromAudioFile.ts

from code:

for await (const data of wfReadable) {
        bytesRead += data.length;
        updateProgressBar(bytesRead);
        const endOfSpeech = recognizer.acceptWaveform(data);
        if (endOfSpeech) {
          const result = recognizer.result();
          results.push(result);
        }
}

to this:

for await (const data of wfReadable) {

        bytesRead += data.length;
        updateProgressBar(bytesRead);
        const endOfSpeech = recognizer.acceptWaveform(data);
        if (endOfSpeech) {
            const result = recognizer.result();
            results.push(result);
        }
        else {
            const partialResult = recognizer.partialResult();
            results.push(partialResult);
        }
}

const finalResult = recognizer.finalResult(recognizer);
results.push(finalResult);
iceberg53 commented 3 weeks ago

Thanks a lot. I already came up with a fix in this commit but I'm really happy to have your take on that issue.

ivanov84 commented 3 weeks ago

@iceberg53 you are welcome 🙏

Recently veed-io removed a function to create auto subtitles from free plan and I used your product. I have wav files that I generate from elevenlabs-io but I have problem - model doesn't generate subtitle words in original way. Do you know the way to train it?

I made new parametor:

    //IVANOV FIX
    .option(
        '-g --origin [text]',
        "Origin text",
    )

And added my additional functions:

    // IVANOV FIX:
    if (originText) {

        /*console.log(`------------------------------`);
        console.log(`IVANOV origin text: [${originText}]`);
        console.log(`------------------------------`);
        console.log(`IVANOV RAW results FULL: [${JSON.stringify(results)}]`);
        console.log(`------------------------------`);*/

        const ivanovResult: WordResult[] = [];

        results.forEach(({ result: words }) => {
            if (!words) return;
            words.forEach((word: any) => {
                ivanovResult.push(word);
            });
        });

        //console.log(`IVANOV FIXED results FULL: [${JSON.stringify(ivanovResult)}]`);
        //console.log(`------------------------------`);

        let count = 0;
        let startIndex = 0;
        let cueString = '';
        let endOfOriginText = originText;
        let stopReplacing = false;

        const MAX_CUE_STRING_LENGTH = 20;
        const ivanovResultLength = ivanovResult.length;

        for (let i = 0; i < ivanovResultLength; i++) {

            count++;

            //console.log(`IVANOV index: [${i}]`);
            //console.log(`IVANOV startIndex: [${startIndex}]`);
            //console.log(`IVANOV count: [${count}]`);

            const wordContent = ivanovResult[i];
            //console.log(`IVANOV wordContent${i}: ${JSON.stringify(wordContent)}`);

            const firstWord = endOfOriginText.replace(/ .*/,'');
            const decodedWord = wordContent.word;
            //console.log(`IVANOV firstWord: [${firstWord}]`);
            //console.log(`IVANOV decodedWord: [${decodedWord}]`);

            const firstWordCleaned = firstWord?.toLowerCase().replace(/'/g, '');
            const decodedWordCleaned = decodedWord?.toLowerCase().replace(/'/g, '');
            console.log(`IVANOV CREANED firstWord: [${firstWordCleaned}]`);
            console.log(`IVANOV CREANED decodedWord: [${decodedWordCleaned}]`);

            if (!stopReplacing && firstWordCleaned.includes(decodedWordCleaned)) {
                wordContent.word = firstWord;
                endOfOriginText = endOfOriginText.replace(firstWord + ' ','');
                //console.log(`IVANOV STANDARD endOfOriginText: [${endOfOriginText}]`);
            }
            else {
                stopReplacing = true;
            }

            const currentWord = wordContent.word;
            cueString += currentWord + ' ';

            let wordsLengthLimit = false;
            const nextWordIndex = i + 1;
            if (nextWordIndex < ivanovResultLength) {
                const nextWordContent = ivanovResult[nextWordIndex];
                const nextWord = nextWordContent?.word || '';
                wordsLengthLimit = (cueString + nextWord).length > MAX_CUE_STRING_LENGTH;
            }

            const lastChar = currentWord.slice(-1);

            //console.log(`IVANOV lastChar: [${lastChar}]`);
            //console.log(`IVANOV cueString: [${cueString}]`);
            //console.log(`IVANOV cueString length: [${cueString.length}]`);

            if (lastChar == '.' || lastChar == '?' || wordsLengthLimit || count >= WORDS_PER_LINE && currentWord != 'a' && currentWord != 'to') {

                const end = Math.min(startIndex + count - 1, ivanovResult.length - 1);
                const cue = createCueFromWords(ivanovResult, startIndex, end);
                subtitles.push(cue);

                console.log(`IVANOV pushed cue: [${JSON.stringify(cue)}]`);

                startIndex = i + 1;
                count = 0;
                cueString = '';
            }

            console.log(`------------------------------`);
        };
    }
    else {
        results.forEach(({ result: words }) => {
            if (!words) return;
            for (let start = 0; start < words.length; start += WORDS_PER_LINE) {
                const end = Math.min(start + WORDS_PER_LINE - 1, words.length - 1);
                const cue = createCueFromWords(words, start, end);
                subtitles.push(cue);
            }
        });
    }

For example the model makes a mistake like: instead of its consequences it make its consequences says Do you how to fix it?

iceberg53 commented 3 weeks ago

I don't know how to get the model to correct it. The models used in gen-subs vary in accuracy but the most accurate models require more computing resources. If your hardware is powerful enough, you should try a better model by downloading it with gen-subs. Otherwise, a workaround would be to generate subtitles first and then edit them to remove the inaccuracies. There's also an alternative to vosk the engine behind gen-subs. It's a more popular open source project from OpenAI which may give you better results: whisper and one of its implementations whisper.cpp . Maybe you should give it a try and see what you get . My expertise in AI and related fields is limited so I think there might be other ways to do it.

ivanov84 commented 3 weeks ago

Thank you very much! But I downloaded the full model before. Generating subtitles before publishing is not an option, because I have a lot of videos that need to be decorated with subtitles. Wisper and others = good, but I need an interface like your product, so let it be more imprecise for a while 🙏

iceberg53 commented 3 weeks ago

Ok, that's fine. In the end, what matters is that people understand the content of your videos.

ivanov84 commented 3 weeks ago

Typically the model makes one of 3 mistakes: either an incorrect word, or 1 extra word, or 1 word is missing. If there are more than 2 mistakes in a row, then my correction will not help. But for now it works for me and I’m completely satisfied.

My code, if anyone is interested:

import { stringifySync } from "subtitle";
import { createCueFromWords } from "./createCueFromWords";
import { RecognitionResults, WordResult } from "vosk";

export async function createSrtFromRecognitionResults(results: RecognitionResults[], originText?: string) {

  //const WORDS_PER_LINE = 7;

  // IVANOV FIX:
  const WORDS_PER_LINE = 3;

  const subtitles: SubtitleCue[] = [];

  if (!results.length) {
    throw new Error("No words identified to create subtitles from.");
  }

  /*results.forEach(({ result: words }) => {
    if (!words) return;
    for (let start = 0; start < words.length; start += WORDS_PER_LINE) {
      const end = Math.min(start + WORDS_PER_LINE - 1, words.length - 1);
      const cue = createCueFromWords(words, start, end);
      subtitles.push(cue);
    }
  });*/

    // IVANOV FIX:
    if (originText) {

        /*console.log(`------------------------------`);
        console.log(`IVANOV origin text: [${originText}]`);
        console.log(`------------------------------`);
        console.log(`IVANOV RAW results FULL: [${JSON.stringify(results)}]`);
        console.log(`------------------------------`);*/

        const ivanovResult: WordResult[] = [];

        results.forEach(({ result: words }) => {
            if (!words) return;
            words.forEach((word: any) => {
                ivanovResult.push(word);
            });
        });

        //console.log(`IVANOV FIXED results FULL: [${JSON.stringify(ivanovResult)}]`);
        //console.log(`------------------------------`);

        let count = 0;
        let startIndex = 0;
        let cueString = '';
        let endOfOriginText = originText;
        let stopReplacing = false;

        const MAX_CUE_STRING_LENGTH = 20;
        const ivanovResultLength = ivanovResult.length;

        for (let i = 0; i < ivanovResultLength; i++) {

            count++;

            //console.log(`IVANOV index: [${i}]`);
            //console.log(`IVANOV startIndex: [${startIndex}]`);
            //console.log(`IVANOV count: [${count}]`);

            const wordContent = ivanovResult[i];
            //console.log(`IVANOV wordContent${i}: ${JSON.stringify(wordContent)}`);

            const firstWord = endOfOriginText.replace(/ .*/,'');
            const decodedWord = wordContent.word;
            //console.log(`IVANOV firstWord: [${firstWord}]`);
            //console.log(`IVANOV decodedWord: [${decodedWord}]`);

            const firstWordCleaned = firstWord?.toLowerCase().replace(/'/g, '');
            const decodedWordCleaned = decodedWord?.toLowerCase().replace(/'/g, '');
            console.log(`IVANOV CREANED firstWord: [${firstWordCleaned}]`);
            console.log(`IVANOV CREANED decodedWord: [${decodedWordCleaned}]`);

            //console.log(`IVANOV STOP replaceing: [${stopReplacing}]`);
            //console.log(`IVANOV INCLUDES: [${firstWordCleaned.includes(decodedWordCleaned)}]`);

            if (!stopReplacing) {

                if (firstWordCleaned.includes(decodedWordCleaned)) {
                    wordContent.word = firstWord;
                    //console.log(`IVANOV STANDARD endOfOriginText: [${endOfOriginText}]`);
                    endOfOriginText = endOfOriginText.replace(firstWord + ' ','');
                    console.log(`IVANOV STANDARD MATCH`);
                }
                else {

                    const nextWordIndex = i + 1;
                    const tempEndOfOriginText = endOfOriginText.replace(firstWord + ' ','');
                    const nextFirstWord = tempEndOfOriginText.replace(/ .*/,'');
                    const nextFirstWordCleaned = nextFirstWord?.toLowerCase().replace(/'/g, '');

                    if (nextWordIndex < ivanovResultLength && nextFirstWordCleaned) {

                        const nextDecodedWordContent = ivanovResult[nextWordIndex];
                        const nextDecodedWord = nextDecodedWordContent?.word || '';
                        const nextDecodedWordCleaned = nextDecodedWord?.toLowerCase().replace(/'/g, '');

                        console.log(`IVANOV CREANED nextFirstWordCleaned: [${nextFirstWordCleaned}]`);
                        console.log(`IVANOV CREANED nextDecodedWord: [${nextDecodedWord}]`);

                        if (firstWordCleaned.includes(nextDecodedWordCleaned)) {
                            wordContent.word = '';
                            console.log(`IVANOV FIXED MODEL RESULT 111 [current decoded word is a superfluous: next first = current decoded]`);
                        }
                        else if (nextFirstWordCleaned.includes(nextDecodedWordCleaned)) {
                            wordContent.word = firstWord;
                            endOfOriginText = endOfOriginText.replace(firstWord + ' ','');
                            console.log(`IVANOV FIXED MODEL RESULT 222 [current decoded word is false: next first = next decoded]`);
                        }
                        else if (nextFirstWordCleaned.includes(decodedWordCleaned)) {
                            wordContent.word = firstWord + ' ' + nextFirstWord;
                            endOfOriginText = endOfOriginText.replace(firstWord + ' ','');
                            endOfOriginText = endOfOriginText.replace(nextFirstWord + ' ','');
                            console.log(`IVANOV FIXED MODEL RESULT 333 [current first word is missed: next decoded = current first]`);
                        }
                        else {
                            stopReplacing = true;
                            console.log(`IVANOV FIXED MODEL RESULT 444 [stopReplacing after bad correction]`);
                        }
                    }
                    else {
                        console.log(`IVANOV FIXED MODEL RESULT 555 [stopReplacing no next first word]`);
                        stopReplacing = true;
                    }
                }
            }

            const currentWord = wordContent.word;
            cueString += currentWord + ' ';

            let wordsLengthLimit = false;
            const nextWordIndex = i + 1;
            if (nextWordIndex < ivanovResultLength) {
                const nextDecodedWordContent = ivanovResult[nextWordIndex];
                const nextDecodedWord = nextDecodedWordContent?.word || '';
                wordsLengthLimit = (cueString + nextDecodedWord).length > MAX_CUE_STRING_LENGTH;
            }

            const lastChar = currentWord.slice(-1);

            //console.log(`IVANOV lastChar: [${lastChar}]`);
            //console.log(`IVANOV cueString: [${cueString}]`);
            //console.log(`IVANOV cueString length: [${cueString.length}]`);

            if (lastChar == '.' || lastChar == '?' || wordsLengthLimit || count >= WORDS_PER_LINE && currentWord != 'a' && currentWord != 'to') {

                const end = Math.min(startIndex + count - 1, ivanovResult.length - 1);
                const cue = createCueFromWords(ivanovResult, startIndex, end);
                subtitles.push(cue);

                console.log(`IVANOV pushed cue: [${JSON.stringify(cue)}]`);

                startIndex = i + 1;
                count = 0;
                cueString = '';
            }

            console.log(`------------------------------`);
        };
    }
    else {
        results.forEach(({ result: words }) => {
            if (!words) return;
            for (let start = 0; start < words.length; start += WORDS_PER_LINE) {
                const end = Math.min(start + WORDS_PER_LINE - 1, words.length - 1);
                const cue = createCueFromWords(words, start, end);
                subtitles.push(cue);
            }
        });
    }

    return stringifySync(subtitles, { format: "SRT" });
}