Bug: Do not apply `--vfilter` to reference input during `crf-search`

In my evaluation of ab-av1's crf-search I noticed that the scores seemed too high compared to my manual tests. I now realized that, according to --help, --vfilter is also applied to the reference input unless --reference-vfilter is set. I think that this approach is flawed and will lead to overestimation in case of downscaling.

This is what I get without --reference-vfilter:

./ab-av1 crf-search --max-crf=16 --min-vmaf=89 --thorough --vfilter scale=qhd:flags=spline+accurate_rnd --sample-duration=10s --sample-every=100s --svt tune=3:scm=0 -i sample.mkv
...
crf 16 VMAF 92.72 predicted video stream size 22.02 MiB (34%)

And this is the result with --reference-vfilter=copy:

./ab-av1 crf-search --max-crf=16 --min-vmaf=89 --thorough --vfilter scale=qhd:flags=spline+accurate_rnd --reference-vfilter=copy --sample-duration=10s --sample-every=100s --svt tune=3:scm=0 -i sample.mkv
...
crf 15 VMAF 89.15 predicted video stream size 23.68 MiB (36%)

The input is a 5 minutes sample of a 1080p video, which I want to downscale to 540p ('qhd' in ffmpeg-utils parlance). What ab-av1 does in the default case is: apply the above scale filter to both the reference and the encoded sample and later in the chain both get rescaled to 1080p for VMAF calculation. But this is not what I expected given the intent of VMAF's default model. I expected the encoded version to be compared to the unscaled version of the input, or rather the reference being scaled only once to fit the VMAF model, which in this case should be a no-op, because the source is already 1080p.

BTW, I find bicubic an odd choice for the scaling, given its blurriness. That might even flatter the encoded version, since it can hide more artifacts which VMAF is supposed to "find". Lanczos or spline seem more appropriate, my preference being spline. Never mind, I found the recommendation by the VMAF devs.

Basically, what I expected was a comparison to the untouched original and what I got was a comparison to the downscaled reference that gets scaled up again. While the latter is more suited to gauging actual quality loss that is purely introduced by the codec, the former is more suited to gauging overall quality loss, which is my understanding of the original intent of VMAF. And since we are talking about machine learning results, on should not discount the influence of changing the underlying assumptions to the model.

To drive my point home, I want to briefly summarize my understanding of the inception of VMAF and its intended use. The original training of the (now default) model was done on an HDTV (1080p) in a living room setting and a viewing distance of 3 times the height (3H) of the TV screen. Netflix wanted to compare subjective quality assessments when people could actually appreciate 1080p, hence the 3H distance. So it does not make sense to downscale the 1080p reference.

I actually want to know the perceived quality loss, including that of downscaling. Having looked at some of the convex hull encoding articles and presentations this seems exactly the purpose of VMAF: find the sweet spot of quality with degrees of freedom not only in the encoder settings domain but also the resolution. A 1080p video at crf=70 might look way worse than 540p or even 360p at crf=35. Since I am kind of a hoarder and want to keep as much video entertainment on my puny 1TB SSD as possible, I found that downscaling saves a lot of space for not too much loss in viewing pleasure. In other words: I couldn't care less if I can read the newspaper on the desk in a movie scene or some such - and stop placing that kind of non-sense easter egg, Hollywood, they only look like a desperate attempt at justifying HD; if your story is malarkey HD won't save you. ;-) Plus, my Ryzen 5 3500U is not powerful enough to even contemplate encoding 1080p av1, 432p or 540p on the other hand are quite feasible and net some great space savings compared to my trusted x264 toolchain.

Additional info

I created a 10s sample (1080p) and used auto-encode with and without --reference-vfilter copy. With:

./ab-av1 auto-encode --min-vmaf=89 --thorough --vfilter scale=qhd:flags=spline+accurate_rnd --samples=1 --svt tune=3:scm=0 -i sample.10s.mkv -o sample.10s.ref-vfilter=vfilter.mkv
  Searching 00:01:27 [...] (crf 29, VMAF 89.08, size 20%)
  Encoding  00:00:06 [...] (40 fps, eta 0s)
Encoded 593.63 KiB (27%)
./ab-av1 vmaf --reference sample.10s.mkv --distorted sample.10s.ref-vfilter=vfilter.mkv
  00:00:14 [...] (vmaf 18 fps, eta 0s)
84.53321

(replaced the '#' progress bar)

Note how the actual VMAF score is roughly 5 points below the "estimate", which is actually a precise prediction because the input file is shorter than the sample duration, hence the sample is the whole input, i.e. predicted and actual VMAF value should be equal.

Without:

./ab-av1 auto-encode --cache false --min-vmaf=89 --thorough --vfilter scale=qhd:flags=spline+accurate_rnd --reference-vfilter copy --samples=1 --svt tune=3:scm=0 -i sample.10s.mkv -o sample.10s.ref-vfilter=copy.mkv
  Searching 00:01:02 [...] (crf 10, VMAF 88.98, size 65%)
Error: Failed to find a suitable crf

As is expected if the actual VMAF is 84.5.

Same as before but with --min-vmaf 84.5:

./ab-av1 auto-encode --cache false --min-vmaf=84.5 --thorough --vfilter scale=qhd:flags=spline+accurate_rnd --reference-vfilter copy --samples=1 --svt tune=3:scm=0 -i sample.10s.mkv -o sample.10s.ref-vfilter=copy.mkv
  Searching 00:01:03 [...] (crf 29, VMAF 84.53, size 20%)
  Encoding  00:00:06 [...] (40 fps, eta 0s)
Encoded 593.63 KiB (27%)
./ab-av1 vmaf --reference sample.10s.mkv --distorted sample.10s.ref-vfilter=copy.mkv
  00:00:14 [...] (vmaf 18 fps, eta 0s)
84.53321

Compare this VMAF to the 1st run above and note how they are exactly the same. So I would propose to change the default behaviour to not apply the same filter chain to the reference, which is also somewhat hidden in the documentation at the bottom and only implicitly declared. If one actually wants to just capture coding losses without the scaling element, one can just set the same values for reference and encoding filter chain. And maybe as a bonus ab-av1 could only scale the reference if it is actually a different resolution than the chosen model, because I am not quite sure if ffmpeg's scale filter automatically becomes a no-op if input resolution is equal to output resolution.

On a side note, I did not expect the second run with different filter setup to pull the result of the previous one from the cache, but it did. Maybe that warrants another issue?

P.S.: Captain obvious adds that the workaround is to use --reference-vfilter copy for the time being. But I really think the current way it works is not technically correct and leads to unexpected results, as I hope I have illustrated sufficiently by now. ;-)

On a side note, I did not expect the second run with different filter setup to pull the result of the previous one from the cache, but it did. Maybe that warrants another issue?

Thanks yes this is a bug. Addressed in 434b0f0b0072082568afa3dca169d48e77193f31

Thanks for your suggestion. The intention is indeed is to measure the VMAF of encoding rather than the vfilter itself. You can override this behaviour with --reference-vfilter as you noted.

When downscaling ab-av1 is telling you the quality of the encode compared to a "perfect" downscale, it is not telling you how much quality was lost by downscaling. I see this as generally more useful for this case and much more generally applicable to using vfilters in general.

Using vfilter and testing against an unmodified reference only works in very special cases. In general doing this will either just not work, vmaf will fail, or vmaf will return a very low score because the distorted video has been changed by the vfilter. E.g. changing framerate, scaling, cropping, modifying colors etc... I wouldn't expect to be able to VMAF compare to the reference in a useful way.

In your example scaling to 540p does work, but it's a side effect of another setting --vmaf-scale that isn't directly related.

--vmaf-scale Video resolution scale to use in VMAF analysis. If set, video streams will be bicupic scaled to this width during VMAF analysis. auto (default) automatically sets based on the model and input video resolution. none disables any scaling. WxH format may be used to specify custom scaling, e.g. 1920x1080.

auto behaviour: 1k model (default for resolutions <= 2560x1440) if width and height are less than 1728 & 972 respectively upscale to 1080p. Otherwise no scaling. 4k model (default for resolutions > 2560x1440) if width and height are less than 3456 & 1944 respectively upscale to 4k. Otherwise no scaling.

Scaling happens after any input/reference vfilters.

[default: auto]

For VMAF comparison of 540p videos the default behaviour is to bicubic-upscale to 1080p and use the 1k model. This is/was the recommended procedure according to the vmaf devs. In particular 1k model vmaf will return much higher score for <1080p videos if they not upscaled this way. But the intention of this logic is not to scale differently sized videos.

For example, if instead of doing 1080p->540p you were encoding 2160p->1080p you would find that using --reference-vfilter copy fails [Parsed_libvmaf_7 @ 0x7d06b0021040] input width must match. since no additional scaling will be automatically done.

So it is possible to compare to an unscaled reference video when:

--vfilter is only scaling the video
--reference-vfilter copy is used to preventing the vfilter applying to the reference
--vmaf-model= && --vmaf-scale is set, since the defaults may not work

For a reference near 1080p or below --vfilter SCALING_ONLY --reference-vfilter copy --vmaf model=version=vmaf_v0.6.1 --vmaf-scale=1920x1080

So I believe the current behaviour is the more correct behaviour as it supports VMAF comparison of encoding for arbitrary vfilters. This issue does suggest that the docs should be clarified. Perhaps we can improve the --vfilter docs?

Thanks for your elaboration but, respectfully, I disagree. May I direct your attention to Netflix's recommendations? They say: compare to the unfiltered reference. And as my example above shows the prediction is inconsistent with a subsequent run of vmaf, which does follow Netflix's recommendation, but only by accident since I don't have to use the scale filter in that chain, being that the encoded version is already scaled down. But I admit that in my suggestion of a workaround I only had scaling in mind, thus it is not a universal solution.

May I suggest an alternative? What about exposing the scale filter of ffmpeg separately and only apply it to the encoded version at the end or start of the filter chain, not sure which end yet. On one hand I prefer decimating frames via the fps filter being done first, because then the scale filter only has to deal with half the frames (I often come across videos that have a frame rate of 50 fps but in reality are 25 fps with every second frame being a duplicate of the preceding one). On the other hand other filter chains might benefit from the lower resolution. Maybe something like --vf-pre (prepend to the --vfilter chain) and --vf-add (append) for being able to decide on a case by case basis? This way all the --vfilter chain can keep being applied to the reference, i.e. fps and only the scale filter would differ.

On another side note, may I suggest exposing ffmpeg's -sws_flags as well? I had a peak at the code and found that all instance of the scale filter set flags explicitly. What about constructing the ffmpeg command in a way that leverages -sws_flags instead?

ffmpeg <input-params> -i <input> <output-params> -sws_flags <--sws-flags> -vf scale=<resolution>,<other filters> ...

This way all instances of scale will automatically set the flags specified by -sws_flags. That would essentially open the possibility of using different scaling algorithms for the VMAF calculation as well. While the Netflix recommendation states bicubic as the best approach when the scaler of the viewing equipment is unkown, I for one do know that my viewing program (mpv) has great upscaling capabilities, so to match my scenario I would like to be able to use -sws_flags spline for the scaling in the VMAF calculation filter chain, and all other instances of the scale filter, of course. Currently I see no way of overriding bicubic as the scaler for VMAF calculation.

In your example scaling to 540p does work, but it's a side effect of another setting --vmaf-scale that isn't directly related.
--vmaf-scale <VMAF_SCALE>
Video resolution scale to use in VMAF analysis. If set, video streams will be bicupic scaled to this width during VMAF analysis. auto (default) automatically sets based on the model and input video resolution. none disables any scaling. WxH format may be used to specify custom scaling, e.g. 1920x1080.
TYPO alert: it's 'bicubic' not 'bicupic'. Do I get a gummy bear now? :) Also, it says that scaling will be to match the "width", yet the option requires both values width and height (WxH).
auto behaviour: * 1k model (default for resolutions <= 2560x1440) if width and height are less than 1728 & 972 respectively upscale to 1080p. Otherwise no scaling. * 4k model (default for resolutions > 2560x1440) if width and height are less than 3456 & 1944 respectively upscale to 4k. Otherwise no scaling.

Scaling happens after any input/reference vfilters.

[default: auto]
For VMAF comparison of 540p vidThis lends itself rather nicely for the final scaling stage, almost as if it was made as a companion for the libvmaf filter.eos the default behaviour is to bicubic-upscale to 1080p and use the 1k model. This is/was the recommended procedure according to the vmaf devs. In particular 1k model vmaf will return much higher score for <1080p videos if they not upscaled this way. But the intention of this logic is not to scale differently sized videos.

For example, if instead of doing 1080p->540p you were encoding 2160p->1080p you would find that using --reference-vfilter copy fails [Parsed_libvmaf_7 @ 0x7d06b0021040] input width must match. since no additional scaling will be automatically done.

Having thought about this for a while I now believe this is an actual bug, because --vmaf-scale does not do as advertised. In this case it just decides to not insert the last stage scale filter:

ab-av1 crf-search --cache false --vfilter scale=hd1080 --reference-vfilter copy --samples=1 --preset 12 -i sample.4k.mkv
  00:00:09 ################################################################################################### (sampling crf 32, eta 0s)Error: ffmpeg vmaf exit code 234
----cmd-----
ffmpeg -r 24 -i /home/marcus/tmp/vidz/.ab-av1-MZ7LM0mzrO3T/sample.4k.av1.crf32.12.mkv -r 24 -i sample.4k.mkv -filter_complex [0:v]format=yuv420p10le,setpts=PTS-STARTPTS[dis];[1:v]format=yuv420p10le,copy,setpts=PTS-STARTPTS[ref];[dis][ref]libvmaf=n_threads=8 -f null -
---stderr---
...
[Parsed_libvmaf_5 @ 0x745858004e00] input width must match.
[Parsed_libvmaf_5 @ 0x745858004e00] input height must match.
[Parsed_libvmaf_5 @ 0x745858004e00] Failed to configure input pad on Parsed_libvmaf_5

As you predicted. But this is contrary to what the help for --vmaf-scale suggests:

auto (default) automatically sets based on the model and input video resolution.

auto behaviour: 1k model (default for resolutions <= 2560x1440) if width and height are less than 1728 & 972 respectively upscale to 1080p. Otherwise no scaling. 4k model (default for resolutions > 2560x1440) if width and height are less than 3456 & 1944 respectively upscale to 4k. Otherwise no scaling.

(BTW, where is that help text? I can find it by grepping the source but not in the output of ab-av1 vmaf --help? I installed the statically linked release 0.7.16. I am certain that I've read it before but can't find the place.)

So in this case, being that the source is 4k (2160p) there should be two things happening, if I understand correctly. 1) ab-av1 should select the 4k model, 2) scaling to 2160p should be happening. As can be seen from the error message, neither is the case. ab-av1 does not scale the encoded ([dis]) version to the reference resolution for some reason. It also fails to detect that the reference file is 2160p and thus the 4k model the appropriate choice.

While I am at it, I think this is an opportunity to outline what I think the filter chain should look like in a more general way. BTW, did you know about ffmpeg's scale2ref filter? I just stumbled upon it during my research relating to this issue. It is somewhat hidden in the HTML documentation on the website, but my local installation of ffmpeg 7.0.1 from the git release/7.0 branch says:

11.220 scale2ref# TOC

Scale (resize) the input video, based on a reference video.

See the scale filter for available options, scale2ref supports the same but uses the reference video instead of the main input as basis. scale2ref also supports the following additional constants for the w and h options: [...]

This lends itself rather nicely for the final scaling stage, almost as if it was made as a companion for the libvmaf filter. Taking the above error message I am going to reconstruct the filter chain below in a way that I think would be the correct approach and then I am going to show how the scale2ref filter might be the better alternative in general.

[0:v]format=yuv420p10le,scale=<--vmaf-scale>:<--sws-flags>,setpts=PTS-STARTPTS[dis];[1:v]format=yuv420p10le,<--reference-vfilter>,scale=<--vmaf-scale>:<--sws-flags>,setpts=PTS-STARTPTS[ref];[dis][ref]libvmaf=n_threads=8:model=<4k model autoselected>

Now the same, leveraging scale2ref:

[0:v]format=yuv420p10le,setpts=PTS-STARTPTS[dis];[1:v]format=yuv420p10le,<--reference-vfilter>,scale=<--vmaf-scale>:<--sws-flags>,setpts=PTS-STARTPTS[ref];[dis][ref]scale2ref=flags=<--sws-flags>,libvmaf=n_threads=8:model=<4k model autoselected>

(--vmaf-scale should never be empty; that's what the help indicates anyway: if user does not set it, ab-av1 will select the appropriate setting; scale=0:0 can be used to make scale no-op, if that helps. --sws-flags should default to bicubic:force_original_aspect_ratio=decrease, and the size value could thus be just the string the user supplied, see below)

See how there is only one instance <--vmaf-scale> now? And that can be --vmaf-scale=0x0 in case the reference is already the size of the model; or just don't bother detecting the size of the reference file and just always set the appropriate resolution for the model. In any case the user is now free to use any combination of --vfilter and --reference-vfilter and --vmaf-scale with no side effects, since the scaling to meet the constraints of libvmaf is done after all those filters and the scale2ref filter with its matching pad configuration feeds directly into libvmaf and ensures that there simply cannot be a mismatch is resolution. As an extra special bonus it would be fantastic to be able to use ffmpeg-utils aliases for resolutions, such as: --vmaf-scale=hd2160.

One last thing, did I miss it or is there no way to get something like a debug output? I found it really hard to follow what is actually going on, almost like black box analysis, short of actually running a debugger, which I am not at all comfortable with - talk about a pig looking at the clockwork, as we say. ;-) It would be nice to be able to inspect the ffmpeg command regardless of an actual error.

In closing, I want to add that I feel unable to contribute, because Rust is way over my head and intimidating to me. Some of my suggestions sound easy in my head, since they are just tweaks, more or less, but I just cannot find an in. If you too think, they are easy, maybe drop me a hint where to start tinkering in the code base?

On another side note, may I suggest exposing ffmpeg's -sws_flags as well?

You can set most ffmpeg args with --enc or --enc-input.

TYPO alert: it's 'bicubic' not 'bicupic'

Also, it says that scaling will be to match the "width", yet the option requires both values width and height (WxH).

Thanks, will fix.

(BTW, where is that help text? I can find it by grepping the source but not in the output of ab-av1 vmaf --help?

That help command should work, e.g. 0.7.16

$ ab-av1 vmaf --help
Full VMAF score calculation, distorted file vs reference file.
Works with videos and images.

* Auto sets model version (4k or 1k) according to resolution.
* Auto sets `n_threads` to system threads.
* Auto upscales lower resolution videos to the model.
* Converts distorted & reference to appropriate format yuv streams before passing to vmaf.

Usage: ab-av1 vmaf [OPTIONS] --reference <REFERENCE> --distorted <DISTORTED>

Options:
      --reference <REFERENCE>
          Reference video file

      --distorted <DISTORTED>
          Re-encoded/distorted video file

      --vmaf <VMAF_ARGS>
          Additional vmaf arg(s). E.g. --vmaf n_threads=8 --vmaf n_subsample=4

          By default `n_threads` is set to available system threads.

          Also see https://ffmpeg.org/ffmpeg-filters.html#libvmaf.

      --vmaf-scale <VMAF_SCALE>
          Video resolution scale to use in VMAF analysis. If set, video streams will be bicupic scaled to this width
          during VMAF analysis. `auto` (default) automatically sets based on the model and input video resolution.
          `none` disables any scaling. `WxH` format may be used to specify custom scaling, e.g. `1920x1080`.

          auto behaviour: * 1k model (default for resolutions <= 2560x1440) if width and height are less than 1728 & 972
          respectively upscale to 1080p. Otherwise no scaling. * 4k model (default for resolutions > 2560x1440) if width
          and height are less than 3456 & 1944 respectively upscale to 4k. Otherwise no scaling.

          Scaling happens after any input/reference vfilters.

          [default: auto]
...

So in this case, being that the source is 4k (2160p) there should be two things happening, if I understand correctly. 1) ab-av1 should select the 4k model

The size the auto behaviour is referring to isn't 4k, it is 1080p. The docs only refer to a single size because there is currently only 1 size to consider and that is the size of the distorted video or the source after vfilters. The vmaf in ab-av1 is currently only designed to compare videos of the same dimensions just like regular vmaf. I'll add some text to clarify this.

One last thing, did I miss it or is there no way to get something like a debug output?

Yep there is since v0.7.15 mentioned in the changelog. Use env var RUST_LOG=ab_av1=debug.

(BTW, where is that help text? I can find it by grepping the source but not in the output of ab-av1 vmaf --help?

That help command should work, e.g. 0.7.16 $ ab-av1 vmaf --help

Ahh, now I see, -h gives me the short and --help the more elaborate help text. Well, that's new and, again, rather unexpected. I am guessing you used some framework for the argument parser and help generator? I am not really convinced that this is a good innovation, since people (I) tend to be lazy and avoid keystrokes where they (I) can, so calling prg -h and expecting the same output as --help seems not unreasonable. And the shell completion defense won't stick either since that's still more keystrokes compared to the two necessary for -h. Other projects have similar options but they are more explicit, which I like more, i.e. x264 --fullhelp. That way at least, I know there is more information somewhere. Nothing indicates the difference between -h and --help. In short: where do I send my bug report? ;-)

So in this case, being that the source is 4k (2160p) there should be two things happening, if I understand correctly. 1) ab-av1 should select the 4k model

The size the auto behaviour is referring to isn't 4k, it is 1080p. The docs only refer to a single size because there is currently only 1 size to consider and that is the size of the distorted video or the source after vfilters. The vmaf in ab-av1 is currently only designed to compare videos of the same dimensions just like regular vmaf. I'll add some text to clarify this.

This seems to contradict the very help text you just quoted, color me confused. It says:

auto (default) automatically sets based on the model and input video resolution.

There are only 3 models I know of, that come bundled with libvmaf: 1k, 4k and phone. The ones ab-av1 seems to consider are the first two:

auto behaviour: 1k model (default for resolutions <= 2560x1440) if width and height are less than 1728 & 972 respectively upscale to 1080p. Otherwise no scaling. 4k model (default for resolutions > 2560x1440) if width and height are less than 3456 & 1944 respectively upscale to 4k. Otherwise no scaling.

I read this as: ab-av1 automatically chooses the vmaf model based on reference resolution (before any filtering, since that seems easiest with ffprobe's help): 1k is default for resolution <= 2560x1440, 4k for >2560x1440, and that also sets the reference scale to 1k or 4k respectively.

Also, about the size to consider after vfilters: why bother? How do you even know the size after vfilters without inspecting ffmpeg internals or re-implementing the filter chain? I think it is a lot simpler to just use scale2ref as outlined above (default: vmaf_scale={model_resolution} as suggested by the help text) and be done with it (and maybe add ts_sync_mode=nearest to the libvmaf options, see above edited post). You don't need to know anything about the input video size this way, that's the beauty of it, even if you decide to keep disagreeing with me - "resistance is futile" ;-) - on the correct default vmaf_scale value (me: model resolution, you: reference video resolution / post --vfilter resolution). If a user wants to use a different reference resolution, say 1440p on a 4k model that can be achieved manually, by setting --vmaf-scale, which would also disable the auto selection heuristic for the vmaf model, so the user should then also set --vmaf=model={m}[:<more options>].

One last thing, did I miss it or is there no way to get something like a debug output?

Yep there is since v0.7.15 mentioned in the changelog. Use env var RUST_LOG=ab_av1=debug.

Thanks! May I kindly ask for a --debug or -v {debug,verbose,info,warning,error}, nonetheless? Because, now that you pointed me to it, I actually do remember reading about and using it, but over the weekend I must have forgotten all about it. Picture me as the scatter brained professor, minus the professor. :-)

alexheretic / ab-av1

Bug: Do not apply `--vfilter` to reference input during `crf-search` #213

Additional info