SubtitleEdit / subtitleedit

the subtitle editor :)
http://www.nikse.dk/SubtitleEdit/Help
GNU General Public License v3.0
8.77k stars 910 forks source link

Subtitle Edit 4.0.4/4.0.5 and Purfview Faster Whisper Broken Merging Subtitle Lines with Chinese/Korean/etc. Transcription #8209

Closed TranslateFuture closed 6 months ago

TranslateFuture commented 6 months ago

Since Subtitle Edit 4.0.4 and still with the latest 4.0.5, if you use the Whisper auto-translate option for Chinese/Korean/etc. then the lines will not be split or have random capitalizations or no periods/commas and so on. Sometimes this behavior only happens at the beginning (or the end) of the audio/video to text, other times it's all the time. Even if you uncheck the post-processing options, the issue still persists.

There seems to be a bug or problem since Subtitle 4.0.3 is still fine, so something must've changed for Subtitle Edit 4.0.4.

This is from Subtitle Edit 4.0.5, it has the lines from the previous dialogue and also the dialogue after it (see partial Korean text/subtitles in the background): https://i.imgur.com/7K9s2G1.png

niksedk commented 6 months ago

You can attach images here on Github (I cannot see the image...)

Also, you should probably attach sample files etc.

TranslateFuture commented 6 months ago

Hello thank you for the quick reply, Subtitle Edit is a lifesaver! Really appreciate what you and the others are creating! Sorry forgot to attach the files earlier.

github translate broken korean drama github translate broken chinese interview

https://github.com/SubtitleEdit/subtitleedit/assets/167344091/f937b1dd-4fde-4ea2-9c62-1d2773ad860b

https://github.com/SubtitleEdit/subtitleedit/assets/167344091/6a2da608-e3f7-45d2-9e6f-a5a6c79db44f

github subtitle edit broken translation for korean chinese etc.zip

The Korean clip is from this behind the scenes video of the Korean drama: https://www.youtube.com/watch?v=Wf3SnaYTAHs

The Chinese clip is from this street interview from Asian Boss: https://www.youtube.com/watch?v=SKBZj5z3cy4

Edit: Oh looks like the burned in subtitles need to be shown since it doesn't show up otherwise.

https://github.com/SubtitleEdit/subtitleedit/assets/167344091/47a52aa8-9f5c-4e27-88be-4738dede7820

https://github.com/SubtitleEdit/subtitleedit/assets/167344091/22de59da-1525-4f5d-8687-2787be6c4e09

So the Chinese one is all mixed up or with merged lines, like the subtitles for the dialogue is too early and so the correct line is buried between the previous sentence/scene and the following sentence. I think it's because of the length of the video/audio but before with Subtitle Edit 4.0.3 and earlier versions, it worked fine even if that same/similar issue occurred from time to time.

The Korean one is also mixing up the lines in the first 3 minutes or so but after that it's back to normal. As in the break between the lines or capitalization or periods/comma are missing or incorrect or misplaced.

These issues are happening since it's translating from Korean or Chinese and so on and then retranslated into English. It seems the raw Korean/Chinese/etc. versions are just fine with the line arrangement and punctuation and so when it gets translated to English is the problem.

And sadly even if you do uncheck the use post-processing or even the auto adjust timings option, the bug persists with the broken/merged English subtitle lines for the current dialogue happening in the video.

Thank you so much for all the hard work and help, Subtitle Edit is really amazing!

Purfview commented 6 months ago

Probably it has nothing to do with SE. Check and post what are the versions of Faster-Whisper with the issue and without the issue.

And post the settings how to reproduce your issues.

TranslateFuture commented 6 months ago

Thank you for the reply, I'm using all the default settings of Subtitle Edit. So basically with those long videos I just use the "Audio to text (Whisper)..." option with the large-v2/etc. model. For the newer Subtitle Edit 4.0.5/etc. in the Advanced section or extra command line arguments, it's just the default --standard, and yup that's really it with no other settings changed at all.

I disabled the "Use post-processing (line merge, fix casing, punctuation, and more)" option (unticked the merge short lines, split long lines, all of it) as that causes the issue with like 99% reproducibility. But even then the merged lines situation is still happening with 4.0.4/4.0.5, like even if it's disabled it's still producing the combined sentences.

When I had to rollback to Subtitle Edit 4.0.3 though, I'm using the latest Faster-Whisper r192.3 (https://github.com/Purfview/whisper-standalone-win/releases/tag/faster-whisper) since otherwise it won't automatically install if using the Subtitle Edit 4.0.3 files (https://github.com/SubtitleEdit/subtitleedit/releases/tag/4.0.3). And now that you mention it, yup it seems like that it's due to the older version of Faster-Whisper in the newer/latest Subtitle Edit releases. Just checked the changelogs and it mentions Subtitle Edit being behind a few Faster-Whisper versions.

Like now with Subtitle Edit 4.0.3 and Faster-Whisper r192.3, if I disable the "Use post-processing..." option, the subtitles will not have that mixed dialogue problem (but sadly there's missing periods/punctuation now). Though if I do enable that post-processing option, it will still cause the merged subtitles.

So it's a bit different in that with Subtitle Edit 4.0.4/4.0.5 it will almost always have that blending issue, even if you disable the "Use post-processing..." option. But with Subtitle Edit 4.0.3 (back when it was first released and in this new setup with Faster-Whisper r192.3), it still works somewhat fine as long as the post-processing option is disabled.

It's sometimes really weird since even the way shorter videos will sometimes get those random capitalizations/merged lines issues and I'm not really altering or changing any settings all, it's all just straight out of the box or the default settings of the Subtitle Edit releases.

If you don't click the "Translate to English" toggle then the Hangul and Chinese characters are seemingly fine as well with the punctuation, spacing, etc. So it's looking like that when it gets automatically translated to English there's some mix-ups on where the sentences should start or end.

Oh and I forgot that I have the "Auto adjust timings" option enabled as well since if it's unticked then it will mess up all of the subtitles too.

But ya I'm not sure what changed since I've had the same (default) settings since before Subtitle 4.0.4/4.0.5. I even uninstalled and everything multiple times, and it does look like it's the difference between Subtitle Edit 4.0.3 and 4.0.4.

Really appreciate what you guys are all doing with these translation stuff, thank you for quickly resolving the problems as well!

Purfview commented 6 months ago

If it's same Faster-Whisper r192.3 on different SE versions then it's some SE issue.

"Use post-processing..." and "Auto adjust timings" <- these are SE only settings, I personally not interested on the issues when those are enabled. You should post those things separated as you are mixing various unrelated issues in the single posts.

This is what I could gaffer from your posts in relation to Faster-Whisper:

...but sadly there's missing periods/punctuation... ...random capitalizations/merged lines...

This looks like usual whisper behaviour. Maybe it could be improved by enabling the custom prompt presets, at the moment those doesn't work when task is to translate.

Post settings used and the srt files showing those issues. ["Use post-processing..." and "Auto adjust timings" must be disabled]

niksedk commented 6 months ago

You could also do translate in a different step - with e.g. DeepL or GoogleTranslate

Purfview commented 6 months ago

@niksedk You closed it because there is no problem with "post-processing"? Or you didn't tried to decipher his posts? 😆