Investigate using the cencro method to speed up VMAF calculation

veikk0 commented 1 year ago

Brought to my attention by this Reddit post is a method to significantly speed up VMAF calculations by cropping the center of the image and calculating the VMAF from that cropped area. As described in the paper, this method is still surprisingly accurate; for a 4K video, an 1800p crop saves up to 40% of processing time with no perceivable error, and a small 360p crop cuts processing time by ~95% with an error of about 4%.

The authors of the paper tested this on five UHD video clips of 10 seconds each, and a test set of 1080p gaming videos called GamingVideoSET. The smaller crops performed better than expected on the gaming footage, and the authors speculated that this may have been because "gaming videos are more centric organized, due to the fact that usually heroes/game characters, or important parts of the game are in the middle of the screen", or because of the fixed block sizes of H.264. So how well different crop sizes actually work for a larger variety of footage and different resolutions seems to be uncertain. Also, it seems the authors used the 4K VMAF model for all VMAF calculations.

An example FFmpeg command for testing a 720p crop using this method:

ffmpeg -r 24 -i distorted.mp4 -r 24 -i reference.mp4 -an -sn -map 0:V -map 1:V -lavfi "[0:v]setpts=PTS-STARTPTS,crop=1280:720[dist];[1:v]setpts=PTS-STARTPTS,crop=1280:720[ref];[dist][ref]libvmaf=model=version=vmaf_4k_v0.6.1:n_threads=16" -t 30 -f null -

(FFmpeg automatically centers the crop when no x&y parameters are given, which is especially handy for this use case).

This could be a very good fit for ab-av1, since VMAF is used extensively. I also have a suspicion that with a large enough number of --samples, any error introduced by this center cropping method will get averaged out. And as long as the accumulated error isn't too large, combining this with n_subsample would make VMAF calculation really fast.

manbug10 commented 1 year ago

This topic is interesting, it should be analyzed and tested @alexheretic

alexheretic commented 1 year ago

This is interesting thanks for raising.

I know the default 1080p model is quite sensitive to resolution, which is why we auto-scale smaller videos to 1080p before running vmaf. This method seems to contradict that, or perhaps the 4k model just isn't sensitive to resolution in the same way? That in itself is quite interesting.

manbug10 commented 1 year ago

Can it be added in future versions?

veikk0 commented 1 year ago

If I had to make a guess, I'd say that because 4K videos have so much detail, cropping even a small area (relative to the total number of pixels) can result in a representative sample. The amount of pixels in 2160p is pretty insane when you think about it, a 720p crop is only 11% of the total pixels, and even 1080p is only a quarter.

So far I've only tested a couple of 1080p videos using the 1080p model, and there's definitely much more error if you try to use crop sizes that are proportionally as drastic as the ones used by the authors for 4K. A 70% crop by width and height (crop=iw*0.7:ih*0.7, about a 50% reduction in the number of pixels) seems to work well for full-frame 1080p. I started an ab-av1 run with --samples 8, copied those sample files elsewhere, and encoded those with a VMAF target of 93. I then calculated VMAF both normally and with a 0.7 crop, Ground truth average VMAF: 93.171823. Crop 0.7: 92.9925925. Error: 0.1792305 VMAF points.

The second test I did wasn't full-frame 1080p, but a lossless version of the Sintel short film at 1920x818 (encoded from lossless PNGs). Same methodology, but only a 0.7 horizontal crop and no vertical cropping, since I figured the resolution was already close enough to 0.7. Ground truth VMAF: 93.06982725. Crop: 93.21469025. Error: 0.144863 VMAF points.

If these results hold up in further testing, then a ~50% resolution reduction and ~2x speed-up in VMAF calculation for full-frame 1080p in exchange for an error of 0.18 VMAF points seems like an acceptable tradeoff to me.

I still want to do further testing, try out 4K, different sample sizes, and a bunch of other things, but I have hardly any high-quality footage, and none for 2160p. I'd prefer using something long, like feature films, or at least short films, to be able to test out a typical ab-av1 use case of taking a number of samples from the same video.

Things I also want to test in the future:

Would using multiple crops from different places in the video and averaging their VMAF get more accurate results? For example, a center crop, and then, say, a one-eight resolution crop from the bottom right corner with crop=in_w/4:in_h/4:in_w:in_h. Would probably work better for 4K than 1080p. Also, as the authors of the paper alluded to in their conclusion, "more advanced patterns than center crops are possible and will be checked to compensate the introduced error."
Vertical video and various types of user-generated content.
Are there differences between encoders and/or encoder settings? Different adaptive quantization methods can allocate bitrate in various ways within a frame, which could affect accuracy. This may or may not make a difference with a large enough sample size.
How well does n_subsample work with this cropping method? The subsampling feature itself can also be inaccurate in an unintuitive way, in my experience it can have a quite large error when using a value of n_subsample divisible by 2. My current theory is that it's because of this bug/feature, which according to the last comment is present in multiple encoders. So strangely enough, n_subsample=3 and 5 are in my experience more accurate than 2, at least with x264 and SVT-AV1.

alexheretic / ab-av1

Investigate using the cencro method to speed up VMAF calculation #133