Closed kornelski closed 3 years ago
Commonly video specifications do not mandate how operations such as chroma upsampling are performed. This is because such operations may be handled through the display pipeline and different implementors may have already their own filter implementations. Such implementations may also be used for multiple specifications. Making this normative could make those implementations not conformant for relatively small benefit. Therefore, mandating such operations would be, in this context, undesirable.
However, it was agreed that if a "recommended" filter was provided as a hint (i.e. metadata) to the decoder that could be useful and maybe could achieve the desired intent, although only for decoders capable of recognizing and handling such a hint. Such hints would be completely optional and a decoder would not need to act on them. The group will try to define such metadata either as a new OBU or using the T.35 mechanism. Similar metadata also exists in other video specifications (see chroma resampling filter hint SEI message in the HEVC/H.265 specification).
The constraints and expectations for video are different than images:
For images it's much more important to have predictable output. Images are used with higher quality/bitrate than videos, and at high qualities these unavoidable differences in chroma algorithm are more destructive than the lossy compression.
All popular image formats display more or less the same on every device. JPEG had a small variation in upsampling algorithms between libjpeg v6 and v9, but that was still relatively small compared to the jarring difference between nearest-neighbor and any form of smoothing that AVIF allows. The 1992 JPEG spec had the ambivalent video-like approach to chroma, but this was recognized as a shortcoming and fixed in the 2015 revision (ISO/IEC 18477).
Metadata in AVIF is already bloated compared to web-oriented formats. A bunch of MPEG boxes on multi-megabyte videos or high-res images from digital cameras may seem like nothing, but for Web use-cases images can be as small as 1-2KB of payload, and 300B+ of AVIF metadata on such images is heavy. On the Web we are trying to microoptimize for things like initial TCP window size, so these bytes do add up. I would prefer AVIF to shrink its metadata (#95), and don't add more.
Images are decoded in software, even where decoding hardware is available. For example, web browsers intentionally avoid using hardware acceleration for image decoding, because they value ability to decode many images in parallel, in incremental fashion (#102), and have precise control over the result. For example, mobile browsers keep subsampled YCbCr in GPU RAM, and perform RGB conversion on the fly at compositing time, in order to save video RAM.
Images are not tied to any weird fixed display pipelines. There's no need for any continuous high-efficiency display channel. They won't be piped through analog video signals (at least not directly, at 1:1 resolution). They don't need things like video decoding overlays used on early 2000's GPUs. They're decoded and converted once, and YCbCr to RGB conversion is basically free compared to the other costs of decoding AV1.
I expect subsampled chroma in AV1 used only either by mistake, or for slightly faster, slightly less lossy conversion of subsampled JPEG or WebP to AVIF. Both JPEG and WebP use smoothing, so 1:1 reuse of their chroma would require AVIF to apply close-enough smoothing (even mismatched chroma sample position doesn't make as much of a difference as nearest-neighbor used instead of bilinear upscaling).
My suggestion is to define a single recommended upsampling filter for chroma, and make it a SHOULD in the spec.
This will avoid extra metadata in AVIF. It will allow software decoders to conform by supporting just that one method instead of adding more code paths. The multitude of color spaces and subsampling variations allowed in AV1 is already painful to support, and forces software to create fast paths for most common combinations of settings. Any new option risks having to double amount of code in the fast paths, which is highly undesirable. Code size is a problem for AVIF adoption currently. Support for exotic inflexible YUV video outputs isn't even on the radar of image decoder implementers.
This was discussed during the last meeting, but we did not have time to fully come to a conclusion. We will discuss it again at the next meeting. Some comments that came up during the discussion in a kind of random order:
This has been extensively discussed during the last couple of meetings with the conclusion that we think it is unlikely that implementations of AV1 decoders/renderers can converge on implementing a common upsampling chroma filter, and therefore recommending one is difficult. Additionally, we would have to recommend at least one filter for HDR and one for SDR.
Outside of AVIF, the following documents exist that contain best practices for dealing with subsampled chroma: SDR: https://www.itu.int/wftp3/av-arch/jvt-site/2003_09_SanDiego/JVT-I019r2.doc HDR: https://www.itu.int/rec/T-REC-H.Sup15-201701-I
The conclusion was also that content creators should ideally avoid subsampled chroma for still images if at all possible. This is extra important for PQ HDR images.
Which implementation(s) of AVIF encoding have a mature YUV444 model (for example with quantization matrices fitted for YUV444) that one can use for quality evaluation of AVIF YUV444?
I can't find whether upsampling filter is specified for either AV1 or AVIF. It seems it only specifies chroma sample position, but not how to interpolate the samples.
Differences in upsampling filters can cause significantly different visual artifacts, and make it impossible for encoders to optimize visual quality and aim for (nearly) pixel-perfect decoding.
I suggest following what JPEG did in JPEG-XT spec, and require a triangle filter when upscaling subsampled chroma channels.