A couple questions - Githubissues

wingedonezero commented 1 year ago

This is exactly what ive been looking for to combine older anime to new bluray sources. My questions i guess are does this work on linux? And how well has it worked for you? I myself like others normally use audacity to find the delay for audio. But this way looks a lot easier. And would help when you just cant figure it out.

Gronis commented 1 year ago

Hi, I made this for syncing anime as well. It works ok as long as there are not additional cuts or stuff like that. It only works if there is a cut at the beginning and/or at the end, or if the tempo of the audio has changed slightly. It's not that tested and the default parameters that exists right now might not work that well for your video. But you can try and maybe we can make the software better in the end :)

Note: you need ffmpeg installed on your system. That should be easy with linux and is probably available in your package manager.

Right now there is no easy way to distribute a binary. You have to compile it from source yourself. Are you familiar with that? Here is how you do it (should work on linux and mac at least):

You install rust toolchain from here: https://rustup.rs/

Then you need to install nightly toolchain that uses "unstable" features (because my implementation uses that).

rustup install nightly
rustup default nightly

Now you can build and run the program with cargo that was installed as part of rust toolchain:

cargo run --release # <--- Build in release mode for faster execution

You should get a help message like this:

Find time offset and scaling between two versions of the same video.

Usage: videodiff [OPTIONS] <ref_file> <target_file>

Arguments:
  <ref_file>     
  <target_file>  

Options:
      --skip-start <secs>  Seconds to skip at start of file
      --skip-end <secs>    Seconds to skip at end of file
  -h, --help               Print help information
  -V, --version            Print version information

Typical usage is to sync the audio of two versions of the same video.
This software compares the photage and provides an offset and scale
for adjusting the target file audio so that it matches the reference file.

NOTE: Requires ffmpeg to be installed.

Use ffmpeg or similar to sync the audio after a result is provided.
  ffmpeg -i ref.mp4 -itsoffset {itsoffset} -i target.mp4 -filter:a:1 "atempo={atempo}"

After you have built the software, you can run it like this:

target/release/videodiff

Let me know if you have any more questions :)

Gronis commented 1 year ago

Also, it is pretty slow if the video is lengthy. I tried it with a animated movie (1h 30m) and that took ages. But for 20min of anime it should be pretty fast.

There are alot of parameters that could be tuned for best results. That is not configurable from the command line (at the moment!), but I could add it if you need it. For example, the itsoffset (time offset) search range is -30sec to +30sec, and the atempo (temp/scaling) search range is 0.99 to 0.01. If this is too narrow search space, you will get junk output. So that should probably be configurable.

Gronis commented 1 year ago

Also, it does not work if there is black bars in one video and not the other. That could of course be fixed by detecting black bars and remove them, but that is not implemented at the moment (since I didn't have that problem).

wingedonezero commented 1 year ago

Thank you for the response. I did have a bit of difficulty at first compiling. Im using debian bookworm And it was stating the rustc version wasn't high enough. Which i learned the exact fix above you mentioned from a article on the site. Wasn't to difficult pretty easy directions.

Now ive been testing a bit. But id like to understand how its finding the proper delay timing a bit more. It seems to just spit out a file after a list of numbers. Ill post a screen below. For my basic use i guess let me explain a bit because i think we both had the same idea.

I normally use audacity and ive been using a couple other tools that cross correlate the audio. But they dont necessarily get it right each time. I've had shows seem right but it winds up being way off because the apps i was using dont pick the most common value or what not. Using audacity and peaks is more accurate as of now. Im one of the people who like to curate the best releases into each other. And most the time it winds up being JPN bluray or dvd to us release with subs. And also the subs from a sub group. Now doing the old way is very time consuming and labor intensive as you probably know lol. For a single series it can take hours.

Now really what i was hoping for was something that could output the delay values even if its cli only and i could use mkv merge to do the rest. Because normally im working with about 3-4 different files at a time for 1 episode. After testing a few files with your software. Is there anyway to have it so it shows the delays in the output Or a list of delay values with the most common highlighted? I reviewed the bit about how to apply it and im not trying to encode the audio at all only find the delay or most common delay.

Normally with all these shows a single value fixes it or the most common does. I do like all the outputted information but i dont know how to read it to archive my goal lol. Honestly compared to the cross correlation ive been trying to use to automate it. I think your onto something here with the image compare. Because the other always has issues when it comes to comparing the two audios because audio can be vastly different in volume and dialog with different languages. Which is why the old audacity way is always the best. But from two synced files having your image compare used instead of the audio could be the game changer. Sorry for the long paragraph ive been trying to figure out a better way to do all this for a very long time lol And i think this may be the route since ive had tons of issues with audio correlation.

Screenshot_20230617_062424

wingedonezero commented 1 year ago

Also adding in after some more testing the results change each run. Same files and same command. And it also for some reason doesn't find the same amount of matches and comapres each time.

Gronis commented 1 year ago

Ok, so I'll explain what happens and what the application does and does not do:

What is the output of videodiff

It gives you the time OFFSET (time to add or skip in the beginning of the video) and TEMPO (playback speed) difference for the target file audio to match the reference video. It does NOT create an mkv or mp4 file for you, it only gives you those numbers and some data about how much you should trust those numbers (that's why it's so verbose).

How to understand the result and get a synced video

I recommend ffmpeg to sync the audio so that it matches the video after getting the offset and tempo. You just apply the -ss (to skip audio) or -itsoffset to add a silent offset before audio, and the atempo to change the playback speed.

In all of your runs, the time-offset seems to be around 0.72832 seconds (for you, ss which means seek, e.g skip this amount of the audio in the beginning) with a tempo multiplier of around 1.000016 (you need to play at 1.000016 times the normal speed). Each "Result" line just means that it has found a new solution with less error (e.g the matched video frames matches better in time with the newly found offset and tempo parameters).

How does it find the time-offset and tempo?

First images are extracted and "hashed" so that visually similar images has the same "hash". This is later used to compare the two videos and match at which offset and tempo (for the target) does a certain frame appear at the same time in both videos. It filters out bad matches that does not happen after each other in time. You get 5000 such frames pairs which is very high (and improves confidence that your results are good).

To find the offset and tempo, a bunch of different values are tested using an algorithm called Particle Swarm Optimization which is a optimization technique that mimics how birds fly in a swarm towards the best solution. The error is measured by the time difference between the matched frames using the offset and tempo values.

For you, this error is only 100 in the end. This means that over 5000 frames roughly a total of 100 seconds of errors where added up. This is 0.02 seconds off per frame. Since the image hash matching isn't 100% accurate (maybe 98%), the error you have is probably a few miss-matched frames that adds most of the error, and then the other frames are pretty much a perfect fit (this is my analysis of your result).

Why do different runs yield different results?

I'm not sure why it doesn't find the same amount of comparable frames every run. There is probably something stochastic (e.g randomness) thing that happens somewhere that I'm not aware of. However when it finds the result (offset and tempo) the algorithm uses randomness as part of finding the solution.

Your results:

Looking at your results, I would say that they are probably very good, since the error is only 107 with over 5000 frames that were identified in both videos. That is a lot of comparable frames. Even thought the result is a little bit different between each run, it is still very close. From your three runs you get:

Run	offset (in seconds)	tempo (playback speedup)
1.	0.72940	1.00002482
2.	0.72369	1.00002005
3.	0.72832	1.00001594

Remember: Since the results says ss for time offset this means you need to remove/drop that amount of audio in the beginning to match the audio. If it had said itsoffset, that means you need to add silence in the beginning.

FFMPEG command to sync frames

To sync the frames, use ffmpeg like so (I'm using the numbers from the 3rd run here):

ffmpeg -i ~/video_ref.mkv -ss 0.72832 -i ~/video_target.mkv -filter:a:1 "atempo=1.00001594" -map 0:v -map 1:a -c:v copy -c:a:0 libopus -b:a:1 96k ~/out.mkv

This will take the video from video_ref.mkv and the audio from video_target.mkv, copy the video track (no video re-encoding), skip 0.72832 seconds of the audio, then play it at 1.000016x of the original speed, then re-encode the audio using libopus at 96k-bits per seconds (which is what I use for anime with stereo audio. Then put the result in out.mkv

Hopfully this made things clearer.

Gronis commented 1 year ago

I can explain my experience a bit better.

So, if I just need to add/remove an offset in the beginning, and/or change the playback speed just a tiny bit, it pretty much works every time. You can look at the error, and number of matched frames to understand if its a good fit. If more than 100 > frames are matched, and the error is less than 0.2 per frame, it is probably resonable fit. If it's more matched frames than 100, even better. If it's less average error per frame than 0.2, its even better.

It does not work well if there are several places in the video where there are added offsets, for example some extended or shortened cuts in the middle of the video. It cannot give you several locations where you need to cut or add more time in the audio track to match the video. The reason for this is that the search space becomes too big and hard to solve and you will probably have very bad result in the end. It could probably be implemented, but I have not done it in this application.

If you try to sync a video with several smaller time differences in some cuts, say 4 different cuts over the entire video, it will try to insert a single cut, and then change the tempo so that the majority of the frames match up. This will result in a bigger error score in the end since many more frames will accumulate error.

For my use-case, I synched about 50 episodes of anime, where about 40 of them worked very well, and the 10 that didn't was because the editing team had added or removed time during cuts in several places in the episode, which I've explained does not work with this implementation. In those episodes, I used Audacity manually in order to identify where the cuts are which was painful labor.

Gronis commented 1 year ago

Hi again!

I tested the latest software on the same anime as last time and got bogus result. I tracked down this to a bug which caused two video streams (images size is resized to 20x20 pixels as part of hashing algorithm). But both the original and the small version was parsed and used by videodiff. This caused RAM usage to spike like crazy (GBs per minute of video) and the timestamp was multiplied by 2 since there where images from two streams coming in.

So, with the current version on master, you should see huge speed improvements (like 100x) because ram usage is 10MB per 1min of video rather than 1GB. And the result should be correct as well 😃

So I recommend you to build the latest version from github before trying to actually sync any video.

Gronis commented 1 year ago

Does your input have multiple cuts in it where the length of the cut varies between the sources (e.g you have to add or remove time in multiple places in a single file when using Audacity to sync the audio)?. I'm thinking about implementing a solver for multiple cuts, so you can specify how many maximum cuts to do, and it will try from 0 up to the specified number. I believe this could work well, since I've been trying to sync with only 30 seconds of video footage and that seems to get enough samples to get a good time offset and scaling in that small time-frame. So maybe a solver with multiple cuts could actually give even better result, if the source videos have multiple cuts in it when the length of the cuts varies between the sources.

If you only have to cut/add silence in one place in Audacity (eg, in the beginning), then the current implementation should work very well for your videos without anything else added.

wingedonezero commented 1 year ago

Sorry for the delay going to take me a bit to finish reading everything because you gave me a lot to look into. I wasn't using ffmpeg as you mentioned before. i was using mkvtoolnix. But it looks like the way you mentioned is the better way. I never messed with tempos before only the delay settings. But i do have some input i have complied the latest version and was going to start testing everything you mentioned but i got a error.

Screenshot_20230618_041710-edit

I have checked and everything is up to date. The previous version worked but with the bug you mentioned. And i think what you mentioned would defiantly give more stable results. Ill be able to do more testing and experiment more once you have a chance to look at the error.

Now as for you other question there are no cuts. I wont be doing any cuts just straight sync. If the video is off because of scenes i just use the us release i haven't quite learned all the cutting stuff. And 90 percent of the time it just syncs and isn't off by to much. mostly dvds and bluray stuff but i do a lot.

Now i do have one question about your ffmpeg line. You have the encode part but i was reading you can do it without encoding the audio. Do you know what the line would be for that? So just basically lossless add the silence or delay and change the tempo. I dont use ffmpeg much so im just learning about all this from you. Didnt even know you could do that with it.

So basically the goal is to sync the audio lossessly with no cut and only change the delay and tempo you mention which i was never doing before. Because from what i found is those parameters you mentioned dont actually edit the file only the meta. And can be done with no reencoding. Its just how as i dont understand all the parameters for ffmpeg.

Also is there any way to pick the audio its syncing? Because most the times there are subs and multiple audios in the files. But when i was running it before i wasn't sure what it syncing i assumed track 1 of each file. I normally add all tracks.

May be a delay in my responses to letting you know ahead of time. But im learning a lot. Ive been doing it the same way for so long because i didnt really have anyone help me with this. Now your giving me so much new things to explore and better my processes. Theres not really a guide for all this in one place to know what the right way is. A lot of people seem to use mkvtoolnix because its easier. But after you comments i learned doesn't work on every device like using ffmpeg does. So i really appreciate all the comments above. And i do think this is going to save so much time with your software. Just have to get the right commands down and experiment a bit more. Look forward to your responses.

Gronis commented 1 year ago

First of, which version of ffmpeg are you using? I think the bug is due to different ffmpeg versions. I'll read all your text once I fixed the bug.

wingedonezero commented 1 year ago

ffmpeg version 5.1.3-1 which is default on Debian bookworm which just released last week.

Gronis commented 1 year ago

Ok, I just pushed a new version that works with both the latest version, and v4.2, to it should work for your version.

wingedonezero commented 1 year ago

That worked ran just fine. Now as for your other fix. Im using the same files in testing and i got 3 different results. Were your exactly the same when you ran it? Because they are all different each run still. Unless i misunderstood you solution.

wingedonezero commented 1 year ago

Fps: 29.97 Decoding frames...
- Got 43833 images. Computing hashes...
- Fps: 29.97 Decoding frames...
- Got 43810 images. Computing hashes... Indexing frames... Comparing frames... (slow)
- Got 14913 comparable frames Dropping non-monotonous frames...
- Got 5827 comparable frames, 9086 dropped Optimizing offset and scale...
- [Result] - ss: 0.00000s, atempo: 1.00000000, error: 1000000.00
- [Result] - ss: 15.52699s, atempo: 0.99292606, error: 640333.55
- [Result] - itsoffset: 6.50288s, atempo: 1.00710083, error: 82782.00
- [Result] - itsoffset: 5.77577s, atempo: 1.00864855, error: 68204.15
- [Result] - ss: 4.86331s, atempo: 0.99192391, error: 66524.48
- [Result] - itsoffset: 5.79147s, atempo: 1.00694187, error: 64525.82
- [Result] - itsoffset: 5.18185s, atempo: 1.00862103, error: 63798.00
- [Result] - itsoffset: 3.55778s, atempo: 1.00188731, error: 57637.20
- [Result] - itsoffset: 5.17117s, atempo: 1.00688974, error: 52225.50
- [Result] - itsoffset: 2.97614s, atempo: 1.00193773, error: 38096.33
- [Result] - itsoffset: 2.37896s, atempo: 1.00193824, error: 23160.94
- [Result] - itsoffset: 1.78236s, atempo: 1.00193825, error: 12397.77
- [Result] - itsoffset: 1.18635s, atempo: 1.00193825, error: 5787.30
- [Result] - itsoffset: 0.59094s, atempo: 1.00193825, error: 3317.02
- [Result] - ss: 1.68590s, atempo: 0.99930248, error: 1953.97
- [Result] - ss: 1.50365s, atempo: 0.99942944, error: 1302.27
- [Result] - ss: 1.09337s, atempo: 0.99930400, error: 609.63
- [Result] - ss: 1.27459s, atempo: 0.99950620, error: 599.49
- [Result] - ss: 0.23760s, atempo: 1.00037993, error: 587.07
- [Result] - ss: 0.87307s, atempo: 0.99964693, error: 280.82
- [Result] - ss: 0.78999s, atempo: 0.99972378, error: 279.17
- [Result] - ss: 0.59467s, atempo: 1.00031194, error: 205.75
- [Result] - ss: 0.78363s, atempo: 0.99992253, error: 115.91
- [Result] - ss: 0.67579s, atempo: 1.00009411, error: 113.10
- [Result] - ss: 0.76751s, atempo: 0.99996769, error: 110.61
- [Result] - ss: 0.69863s, atempo: 1.00005648, error: 109.33
- [Result] - ss: 0.72581s, atempo: 1.00001737, error: 107.99

wingedonezero commented 1 year ago

..

Fps: 29.97 Decoding frames...
Got 43833 images. Computing hashes...
Fps: 29.97 Decoding frames...
Got 43810 images. Computing hashes... Indexing frames... Comparing frames... (slow)
Got 14913 comparable frames Dropping non-monotonous frames...
Got 5817 comparable frames, 9096 dropped Optimizing offset and scale...
[Result] - ss: 0.00000s, atempo: 1.00000000, error: 1000000.00
[Result] - ss: 0.91913s, atempo: 1.00415745, error: 64061.08
[Result] - ss: 0.87978s, atempo: 0.99668440, error: 34230.80
[Result] - itsoffset: 0.25266s, atempo: 1.00398056, error: 28950.10
[Result] - itsoffset: 0.84985s, atempo: 1.00397981, error: 19769.96
[Result] - itsoffset: 1.44645s, atempo: 1.00397980, error: 14756.33
[Result] - itsoffset: 2.04246s, atempo: 1.00397980, error: 13882.58
[Result] - ss: 3.57824s, atempo: 0.99715500, error: 12670.22
[Result] - ss: 2.98273s, atempo: 0.99715563, error: 8062.37
[Result] - ss: 2.11472s, atempo: 0.99925904, error: 5191.89
[Result] - ss: 1.52050s, atempo: 0.99925882, error: 1102.09
[Result] - ss: 0.94719s, atempo: 0.99941726, error: 587.83
[Result] - ss: 0.44525s, atempo: 1.00040415, error: 234.37
[Result] - ss: 0.79887s, atempo: 0.99995088, error: 106.45
[Result] - ss: 0.70504s, atempo: 1.00008423, error: 105.22
[Result] - ss: 0.73269s, atempo: 0.99996986, error: 103.63
[Result] - ss: 0.76297s, atempo: 0.99999038, error: 100.67
[Result] - ss: 0.69870s, atempo: 1.00003844, error: 99.76
[Result] - ss: 0.73992s, atempo: 0.99999261, error: 98.63
[Result] - ss: 0.72811s, atempo: 1.00001515, error: 98.14
Fps: 29.97 Decoding frames...
Got 43833 images. Computing hashes...
Fps: 29.97 Decoding frames...
Got 43810 images. Computing hashes... Indexing frames... Comparing frames... (slow)
Got 14913 comparable frames Dropping non-monotonous frames...
Got 5811 comparable frames, 9102 dropped Optimizing offset and scale...
[Result] - ss: 0.00000s, atempo: 1.00000000, error: 1000000.00
[Result] - ss: 6.15003s, atempo: 0.99767948, error: 92948.67
[Result] - itsoffset: 3.13211s, atempo: 1.00514461, error: 24477.38
[Result] - itsoffset: 2.53643s, atempo: 1.00513776, error: 22851.16
[Result] - ss: 3.44932s, atempo: 0.99797547, error: 14950.35
[Result] - ss: 2.85321s, atempo: 0.99797612, error: 7420.22
[Result] - ss: 2.25780s, atempo: 0.99797612, error: 4016.90
[Result] - ss: 0.48436s, atempo: 1.00103770, error: 2067.63
[Result] - itsoffset: 0.10215s, atempo: 1.00103738, error: 1173.81
[Result] - ss: 0.62622s, atempo: 0.99959745, error: 1091.74
[Result] - ss: 1.20806s, atempo: 0.99959760, error: 524.38
[Result] - ss: 0.22474s, atempo: 1.00067555, error: 522.18
[Result] - ss: 0.43960s, atempo: 1.00024635, error: 271.39
[Result] - ss: 0.83474s, atempo: 0.99980681, error: 163.06
[Result] - ss: 0.76628s, atempo: 0.99987472, error: 153.13
[Result] - ss: 0.70480s, atempo: 1.00012182, error: 139.24
[Result] - ss: 0.77179s, atempo: 0.99990456, error: 134.35
[Result] - ss: 0.72595s, atempo: 1.00005393, error: 121.86
[Result] - ss: 0.70538s, atempo: 1.00005393, error: 118.71
[Result] - ss: 0.72385s, atempo: 1.00002128, error: 117.73

Gronis commented 1 year ago

Your solutions are very close so there will not be a noticeable difference. The algorithm for finding a solution is stochastic (has some randomness) as mentioned before. For example, a tempo of 1.00002128 will speed up a 20min video by 25 milliseconds. And the difference in time offset between 0.72385 and 0.72811 is 4.26 milliseconds, so your solutions are identical in practice. The difference is just noise.

Gronis commented 1 year ago

I've read through your comment regarding lossless output. It seems that in order to do lossless, you cannot change tempo. Therefore I added an option to skip search for tempo and just use 1.0x (no change in tempo). Just add --skip-tempo to use static tempo.

To sync without tempo change (lossless) you can use tell ffmpeg to copy both audio and video codec:

ffmpeg -i ref_video.mkv -ss 0.72385s -i target_video.mkv -map 0:v -map 1:a -c:v copy -c:a copy ~/out.mkv

Edit: This will add all audio tracks from target_video.mkv and skip 0.72385 seconds in the beginning for all of them.

Gronis commented 1 year ago

I think adding silence (when using itsoffset rather than ss) does not work for lossless. This will just have no audio for a few samples and some video players cannot handle that, and the audio gets out of sync. So I recommend that you don't use copy audio codec.

What format do you use when exporting audio from Audacity? Just normal wav? If you don't want to degrade the quality, you can use a lossless audio format for ffmpeg too (like Audacity), for example FLAC of Wav. You have to google how to output that because wav have different formats. FLAC might be your best option, but I'm not sure how many video players support it.

Edit: You can read about the complexity of adding silence without re-encoding here (I just don't bother, its easier to just re-encode): https://superuser.com/questions/579008/add-1-second-of-silence-to-audio-through-ffmpeg

Gronis / videodiff

A couple questions #1

What is the output of videodiff

How to understand the result and get a synced video

How does it find the time-offset and tempo?

Why do different runs yield different results?

Your results:

FFMPEG command to sync frames