Closed stuaxo closed 2 years ago
When downloading videos in other languages like German (yt-dlp --sub-langs de
) you'll see the files being named with .de.vtt
suffix.
videogrep will not find these files. When I rename them to .en.vtt
everything works.
videogrep could search name.*.vtt
or an option to specify the language. :)
@cmprmsd I've just pushed a test branch that should hopefully fix this. If you'd like to test it out, please reinstall videogrep like so:
pip3 install git+https://github.com/antiboredom/videogrep@sub-finder
There is some problem with the ext.replace
functionality:
videogrep --input source/kurzgesagt/*.mp4 --ngrams 1
Traceback (most recent call last):
File "/usr/bin/videogrep", line 8, in <module>
sys.exit(main())
File "/usr/lib/python3.10/site-packages/videogrep/cli.py", line 119, in main
grams = get_ngrams(args.inputfile, args.ngrams)
File "/usr/lib/python3.10/site-packages/videogrep/videogrep.py", line 91, in get_ngrams
transcript = parse_transcript(file)
File "/usr/lib/python3.10/site-packages/videogrep/videogrep.py", line 54, in parse_transcript
subfile = find_transcript(videoname, prefer)
File "/usr/lib/python3.10/site-packages/videogrep/videogrep.py", line 34, in find_transcript
possible_paths = glob(os.path.splitext(videoname)[0] + ext.replace(".", ".*"))
File "/usr/lib/python3.10/glob.py", line 24, in glob
return list(iglob(pathname, root_dir=root_dir, dir_fd=dir_fd, recursive=recursive))
File "/usr/lib/python3.10/glob.py", line 86, in _iglob
for name in glob_in_dir(_join(root_dir, dirname), basename, dir_fd, dironly):
File "/usr/lib/python3.10/glob.py", line 97, in _glob1
return fnmatch.filter(names, pattern)
File "/usr/lib/python3.10/fnmatch.py", line 58, in filter
match = _compile_pattern(pat)
File "/usr/lib/python3.10/fnmatch.py", line 52, in _compile_pattern
return re.compile(res).match
File "/usr/lib/python3.10/re.py", line 251, in compile
return _compile(pattern, flags)
File "/usr/lib/python3.10/re.py", line 303, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python3.10/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "/usr/lib/python3.10/sre_parse.py", line 950, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
File "/usr/lib/python3.10/sre_parse.py", line 443, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "/usr/lib/python3.10/sre_parse.py", line 836, in _parse
p = _parse_sub(source, state, sub_verbose, nested + 1)
File "/usr/lib/python3.10/sre_parse.py", line 443, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "/usr/lib/python3.10/sre_parse.py", line 598, in _parse
raise source.error(msg, len(this) + 1 + len(that))
re.error: bad character range j-h at position 50
This occurs on the kurzgesagt youtube channel video "You Are Not Where You Think You Are [Pj-h6MEgE7I].en.vtt"
If I delete this and another video, it does not throw any error but there are no subtitle files found at all also for en.vtt
files.
When I switch back to the prod videogrep it works immediately on en.vtt
files.
oops! will take another look
I see - I tried to use glob patterns, but this is causing an issue when the file name has square brackets in it!
@cmprmsd updated the code again - can you try once more?
Thanks for your quick reaction!
en.vtt
work now also in the test branch but de.vtt
are still ignored as you can see in the following.
(kurzgesagt contains en.vtt
files. When I rename the mailab file to en.vtt
it works but it should recognize de.vtt
files, right? :blush:
ok, maybe try one more time on the sub-finder
branch!
As often as it needs to be done! :) It seems to work now! Thank you!
PS: I didn't expect THIS accuracy on German transcriptions with VOSK. You did not lie when you wrote 1000% better haha :+1:
PPS: Should I request parallel transcription of videos as a new issue? I think the tool would benefit from this on stronger machines with many cores when you transcribe whole libraries. Pretty sure it's single threaded per video, but there should be no problem with multiple videos.
Edit: Went the unix way.. ls *.mp4 | parallel -j 16 videogrep --input {} --transcribe --model ../vosk-model-small-de-0.15/
@cmprmsd yes let's make a new issue for parallel transcription! closing this now as I've merged sub-finder into master...
The youtube-dl script downloads english subtitles using the naming scheme.
{videoname}.{lang}.srt
e.g.
It might be worth using a library to find the subtitles, (maybe whatever xbmc uses) - so it's easy to find various types.