baxtree / subaligner

Automatically synchronize and translate subtitles, or create new ones by transcribing, using pre-trained DNNs, Forced Alignments and Transformers. https://subaligner.readthedocs.io/
https://hub.docker.com/r/baxtree/subaligner
MIT License
435 stars 18 forks source link

[Questions] Will this work on cut content? #89

Closed Johndirr closed 1 month ago

Johndirr commented 2 months ago

I was wondering how this works on cut content. If you want to align subtitles of a directors cut to a TV version for example. Will subaligner remove lines that are not present in the TV version?

I would also like to know if audio fingerprinting is used somewhere in the process of aligning?

Thank you 👍

baxtree commented 2 months ago

Hi, it doesn't use fingerprinting atm. Each aligning process will take in a single-video and single-subtitle pair. Subtitle literals are not altered, only the time codes, so if your input subtitle has sth the video doesn't manifest you will still end up with out-of-sync output. To summarise the use case: you have a director's cut version and a synchronised subtitle, and you want a new subtitle to contain a subset of literals with time codes adjusted to the TV version.

I reckon the final alignment quality will largely hinge on the fingerprint matching between those two audio tracks. After a quick search I found a bunch of research papers and tools for calculating fingerprinting hashes, and do you happen to know of any real-world adoptions? This would be a nice feature to support if the matching performance is good.

Johndirr commented 2 months ago

Sorry for the dumb question but I had some trouble to understand how subaligner works. I read up on "forced alignment" and I have a much better understanding now. It's really impressive how many commercial products are on the market for forced alignment.

In my specific case I always have a TV recording of a show or a movie and the respective subtitles. I also have a video of the same show/movie with much better picture quality but without any subtitles. By comparing the audio of both videos I'm able to find out how the subtitles have to be edited (shifted, cut).

I usually use one of these tools to support the process of adapting the subtitles:

Since there is a possibility that the source subtitles are not 100% sync or just bad have bad timings there is much manual work needed to fine tune. That's the reason I had a look at your project :). I also tried more then once to write tools to help speeding up the process but was never able to totally automate everything. It's not like you can just match every subtitle line to the new video because there is not always a match. There can also be multiple matches in case of a retrospective or reused music sequences. Or some times the soundtrack was changed. So something like a sliding window for comparison or a conclusion from definitive matches about neighboring subtitle lines must be used.

Visualizing everything was always the biggest help when trying to synchronize subtitles manually because it was instantly clear how the subtitles have to be edited. E.g. via the alignment graph of AudioAlign or by just plotting the different matches.

There are some articles about audio fingerprinting and subtitles on the net. I also read a paper which described the idea of giving every line a fingerprint to match subtitle and video on the fly.

baxtree commented 2 months ago

Thanks for sharing those links and will have a look. Here is a collection of FA tools I rediscovered, maybe out-of-date now.

What I was wondering really is in this subtitle re-sync context, what the quality of fingerprint comparison between two similar audio pieces is generally like (not only for musical content but also for speech). If it is always high, the following-up subtitle alignment could become a trivial task, unless your re-versioned show/movie has cut the speech in half and aligning a whole subtitle line to partial speech could be challenging and another story. Agreed on that visualisation will definitely be helpful, although it isn't the focus of this tool.

Johndirr commented 2 months ago

Wow, this is a great collection of FA tools! Thank you.

Depending on the settings and the used algorithm you get varying results. But it's really easy to find good matches (~90 similarity) for ~90% of a video when you compare a TV VHS recording and a DVD Rip with a default setting. You also get some false positives but these can be filtered out either by setting a lower error threshold or by filtering peeks. Here is an example showing an alignment graph where the offset was correctly detected with around 600 ms. As you can see the peeks could be easily filtered out by doing some kind of averaging: grafik

Speeches that are cut in half or something are an edge case I never saw yet but you are right, this can happen.

It would be nice if I gave you some inspiration for further enhancing your tool but don't feel obliged to do anything. I really was just asking to get a better understanding of everything and maybe I will get my hands dirty and program something again, if I feel like it :)

baxtree commented 2 months ago

Yes it has indeed given me some. I can see the tool you referred to uses DTW behind the scene, which is also powering the FA in subaligner. So in essence, the nature of use-case difference lies in you need an automagical cut on the original subtitle to fit a shorter video while in contrast, this tool will never remove or alter subtitle lines (except timecodes) and its FA can only work reliably in scenarios where the subtitle has equal or less information the video convey, but not more.

baxtree commented 1 month ago

Issue closed with no recent activity. I feel it would be easier to cut the subtitle simultaneously while the TV version is being cut with/without the assistance of the UI, allowing the time codes to be "re-mapped". Thanks for the inspirational question.