Subtitle detection - find {videoname}.{lang}.srt

stuaxo commented 10 years ago

The youtube-dl script downloads english subtitles using the naming scheme.

{videoname}.{lang}.srt

e.g.

Bob The Builder.en.srt
Bob The Builder.avi

It might be worth using a library to find the subtitles, (maybe whatever xbmc uses) - so it's easy to find various types.

cmprmsd commented 2 years ago

When downloading videos in other languages like German (yt-dlp --sub-langs de) you'll see the files being named with .de.vtt suffix.

videogrep will not find these files. When I rename them to .en.vtt everything works. videogrep could search name.*.vtt or an option to specify the language. :)

antiboredom commented 2 years ago

@cmprmsd I've just pushed a test branch that should hopefully fix this. If you'd like to test it out, please reinstall videogrep like so:

pip3 install git+https://github.com/antiboredom/videogrep@sub-finder

cmprmsd commented 2 years ago

There is some problem with the ext.replace functionality:

videogrep --input source/kurzgesagt/*.mp4 --ngrams 1
Traceback (most recent call last):
  File "/usr/bin/videogrep", line 8, in <module>
    sys.exit(main())
  File "/usr/lib/python3.10/site-packages/videogrep/cli.py", line 119, in main
    grams = get_ngrams(args.inputfile, args.ngrams)
  File "/usr/lib/python3.10/site-packages/videogrep/videogrep.py", line 91, in get_ngrams
    transcript = parse_transcript(file)
  File "/usr/lib/python3.10/site-packages/videogrep/videogrep.py", line 54, in parse_transcript
    subfile = find_transcript(videoname, prefer)
  File "/usr/lib/python3.10/site-packages/videogrep/videogrep.py", line 34, in find_transcript
    possible_paths = glob(os.path.splitext(videoname)[0] + ext.replace(".", ".*"))
  File "/usr/lib/python3.10/glob.py", line 24, in glob
    return list(iglob(pathname, root_dir=root_dir, dir_fd=dir_fd, recursive=recursive))
  File "/usr/lib/python3.10/glob.py", line 86, in _iglob
    for name in glob_in_dir(_join(root_dir, dirname), basename, dir_fd, dironly):
  File "/usr/lib/python3.10/glob.py", line 97, in _glob1
    return fnmatch.filter(names, pattern)
  File "/usr/lib/python3.10/fnmatch.py", line 58, in filter
    match = _compile_pattern(pat)
  File "/usr/lib/python3.10/fnmatch.py", line 52, in _compile_pattern
    return re.compile(res).match
  File "/usr/lib/python3.10/re.py", line 251, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python3.10/re.py", line 303, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.10/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.10/sre_parse.py", line 950, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.10/sre_parse.py", line 443, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.10/sre_parse.py", line 836, in _parse
    p = _parse_sub(source, state, sub_verbose, nested + 1)
  File "/usr/lib/python3.10/sre_parse.py", line 443, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.10/sre_parse.py", line 598, in _parse
    raise source.error(msg, len(this) + 1 + len(that))
re.error: bad character range j-h at position 50

This occurs on the kurzgesagt youtube channel video "You Are Not Where You Think You Are [Pj-h6MEgE7I].en.vtt"

If I delete this and another video, it does not throw any error but there are no subtitle files found at all also for en.vtt files.

When I switch back to the prod videogrep it works immediately on en.vtt files.

antiboredom commented 2 years ago

oops! will take another look

antiboredom commented 2 years ago

I see - I tried to use glob patterns, but this is causing an issue when the file name has square brackets in it!

antiboredom commented 2 years ago

@cmprmsd updated the code again - can you try once more?

cmprmsd commented 2 years ago

Thanks for your quick reaction! en.vtt work now also in the test branch but de.vtt are still ignored as you can see in the following. (kurzgesagt contains en.vtt files. When I rename the mailab file to en.vtt it works but it should recognize de.vtt files, right? :blush:

antiboredom commented 2 years ago

ok, maybe try one more time on the sub-finder branch!

cmprmsd commented 2 years ago

As often as it needs to be done! :) It seems to work now! Thank you!

PS: I didn't expect THIS accuracy on German transcriptions with VOSK. You did not lie when you wrote 1000% better haha :+1:

PPS: Should I request parallel transcription of videos as a new issue? I think the tool would benefit from this on stronger machines with many cores when you transcribe whole libraries. Pretty sure it's single threaded per video, but there should be no problem with multiple videos. Edit: Went the unix way.. ls *.mp4 | parallel -j 16 videogrep --input {} --transcribe --model ../vosk-model-small-de-0.15/

antiboredom commented 2 years ago

@cmprmsd yes let's make a new issue for parallel transcription! closing this now as I've merged sub-finder into master...

antiboredom / videogrep

Subtitle detection - find {videoname}.{lang}.srt #20