libavif cannot encode a grid if image dimensions are not multiples of cell size

y-guyon commented 2 years ago

Note: This is about AVIF grids/cells (used for incremental decoding), not about AV1 tiles.

Issue

libavif does not provide a way through avifenc or its API to encode an image as a grid of multiple cells if the image dimensions are not multiples of the cell dimensions.

For example avifenc --grid 4x2 on an image of 4002 by 1998 will fail.

Specification analysis

HEIF (ISO 23008-12:2017) and HEIF (ISO 28002-12:2021) say the same:

6.6.2.3.1 Image grid derivation Definition

An item with an item_type value of 'grid' defines a derived image item whose reconstructed image is formed from one or more input images in a given grid order within a larger canvas.

The input images are inserted in row-major order, top-row first, left to right, in the order of SingleItemTypeReferenceBox of type 'dimg' for this derived image item within the ItemReferenceBox. In the SingleItemTypeReferenceBox of type 'dimg', the value of from_item_ID identifies the derived image item of type 'grid', the value of reference_count shall be equal to rows*columns, and the values of to_item_ID identify the input images. All input images shall have exactly the same width and height; call those tile_width and tile_height. The tiled input images shall completely “cover” the reconstructed image grid canvas, where tile_width*columns is greater than or equal to output_width and tile_height*rows is greater than or equal to output_height.

The reconstructed image is formed by tiling the input images into a grid with a column width (potentially excluding the right-most column) equal to tile_width and a row height (potentially excluding the bottom-most row) equal to tile_height, without gap or overlap, and then trimming on the right and the bottom to the indicated output_width and output_height.

NOTE 1 If the desired input images are not of a consistent size, then derived image items that scale or crop them, as needed to make them consistent, can be used; other specifications can, however, restrict whether derived image items are permissible as input to the image grid derived image item. This document specifies cropping in 6.5.8 and scaling in 6.5.13.

NOTE 2 File writers need to be careful when removing an item that is marked as an input image of an image grid item, as the content of the image grid item may need to be rewritten.

My interpretation of the highlighted sentence:

The input images in the right-most column potentially have a different width than tile_width.
The input images in the bottom-most row potentially have a different height than tile_height.
No matter if they do or do not, after reconstructing the grid, trim it to output_width and output_height.

This contradicts the following:

All input images shall have exactly the same width and height

Example

An encoded AVIF bitstream with the following AV1 frames and the grid properties output_width and output_height set to 108:

[100x100] [8x100]
[100x  8] [8x  8]

Can only be decoded by libavif if the ispe properties are all the same, so for example:

[100x100] [100x100]
[100x100] [100x100]

libavif will then:

Decode each AV1 frame
Resize each AV1 frame to the item's ispe dimensions (hence 100x100), if not already matching that size
Combine all AVIF tiles into the reconstructed grid image (here 200*200)
Trim the reconstructed grid image to output_widthxoutput_height (here 108x108)

The file itself is valid but the output is not what was intended (the right and bottom parts are stretched and cut). Removing the ispe contraint would make that possible.

Option 1: allow different dimensions for right-most and bottom-most cells

The implementation of https://github.com/AOMediaCodec/libavif/pull/1140 matches this option.

Issues:

The libavif encoding and decoding sides are more permissive.
The specification will likely require a clarifying amendment that it is allowed.
The specification will likely require a clarifying amendment about whether right-most and bottom-most cells can be bigger than tile_width/tile_height, or just smaller.

AVIF files encoded with this modified avifenc do not pass the Compliance Warden (640x481 jpg image encoded with avifenc --yuv 444 --grid 2x2):

+--------------------------------------+
| heif validation |
+--------------------------------------+

Specification description: HEIF - ISO/IEC 23008-12 - 2nd Edition N18310

[heif][Rule #7] Error: Tiles [ItemId]: all input images shall have exactly the same width and height
but found 1x320 for itemID=256 in 'ispe'
[heif][Rule #7] Error: Tiles [ItemId]: all input images shall have exactly the same width and height
but found 1x320 for itemID=256 in 'ispe'
[heif][Rule #8] Error: grid (itemID=1) height(4) not covered by tile (ItemId=481) height(225)*numRows(2)=450
[heif][Rule #8] Error: grid (itemID=1) height(5) not covered by tile (ItemId=481) height(225)*numRows(2)=450

Note: The last two errors look suspicious because the image was correctly decoded with avifdec.

However, even with libavif at head cd0bb358f83d01867f0fa53079470043618c9af5, encoding a 640x480 png with avifenc --grid 2x1 --yuv 444 did not pass the Compliance Warden either ([miaf][Rule #5] Error: construction_method=-1 on a derived image item (ID=1)).

Option 2: only trim

If we still want to allow encoding images with dimensions that cannot be a convenient multiple of tile_width and tile_height in libavif, there are two ways, both based on enforcing all cells to share the same dimensions, and then cropping to output_width and output_height.

Issues:

Privacy concern: we will encode a part of the image that will not be visible by default. It will raise the same questions as the clap property did.
Encoded files will be bigger for the same content.
The specification will likely require a clarifying amendment that it is forbidden (remove the "potentially excluding..." parts).

Option 2.1: add/modify avifEncoderAddImageGrid() API

This will put the burden of generating cells of the same size to the user. Also there is currently no way to pass the desired output_width and output_height to the avifEncoderAddImageGrid() function or as a flag to avifenc.

Issues:

Breaking API change.

Alternatives for avoiding a breaking API change:

Add a new function next to avifEncoderAddImageGrid() with the above behavior. \ Note: avifEncoderAddImageGrid() is already not that convenient. avifenc has some dedicated code to slice an input image into cells, but currently it only does so if the input image has dimensions that are multiples of tile_width and tile_height. It might be helpful to improve that code and move it into a new function available in avif.h.
Add output_width and output_height to avifEncoder. Could default to 0. \ The advantage of this solution is that it could apply to other scenarios than grids only (provides an API to rescale images at decoding).

Option 2.2: keep the same API but fix the issue internally

Basically add a step between avifEncoderAddImageGrid() and avifEncoderAddImageInternal() to convert the "imperfect" grid into a "perfect" grid.

Issues:

Right-most and bottom-most cells will need to be copied into bigger cells (an extra copy of the input image is fine).
The extra padding will need to be set to some value. Choosing that value is not trivial if we take encoding efficiency into account (not a big issue but requires thought and implementation).

joedrago commented 2 years ago

I feel like this doesn't require any changes in the library, but just a feature in avifenc itself. Grid cells are supposed to be the same size, so I think the library is behaving properly, and I think they expect you to use a crop rect (clap) to dial in the correct size.

This avifenc feature would simply round up the image's size to the next grid cell multiple, and then clap the resultant encoding back to the original dimensions. I don't believe this is one of the listed Options.

wantehchang commented 2 years ago

Yannis,

Thank you for the analysis of the spec. I agree there is a contradiction. I think the intention is "All input images shall have exactly the same width and height." Given this assumption, we can fix the contradiction by removing "(potentially excluding the right-most column)" and "(potentially excluding the bottom-most row)".

This assumption implies Option 2. We can try Option 2.2 first. For padding values it is common to pad with border pixels.

If I understand it correctly, Option 2.1 and Option 2.2 are not mutually exclusive, so we can still do Option 2.1 in the future.

y-guyon commented 2 years ago

I feel like this doesn't require any changes in the library, but just a feature in avifenc itself.

If users of the libavif API would like to generate incrementally decodable images of any size, it would be convenient for them to have a function doing the dirty work of splitting and padding. If we do this work in avifenc, we might as well make it accessible in avif.h "for free".

Grid cells are supposed to be the same size, so I think the library is behaving properly

It matches one interpretation of the specification, yes.

When you store (100+8)x(100+8) AV1 samples, you can only end up with a (100+100)x(100+100) grid item output, where some tiles were deformed by scaling up. The grid item is then cropped to 108x108 (with output_width/height or clap) so some scaled AV1 samples are discarded. \ And on the other hand, if you want the correct output for a 108x108 input, it must be padded to 200x200 before encoding, and cropped at decoding to 108x108. \ I was mainly pointing out the oddness of being able to store exactly 108x108 AV1 samples in a valid grid AVIF without being able to correctly decode them into a 108x108 image.

I think they expect you to use a crop rect (clap) to dial in the correct size.

This avifenc feature would simply round up the image's size to the next grid cell multiple, and then clap the resultant encoding back to the original dimensions. I don't believe this is one of the listed Options.

Using the clap feature is unnecessary. As mentioned in section 6.6.2.3.1, the reconstructed image is formed by [...] trimming on the right and the bottom to the indicated output_width and output_height. Since output_width and output_height are encoded with every grid box, there is no need for an additional clap box which would have the same effect. The clap box has the same odd dimensions contraints on subsampled chroma samples as grid if I remember correctly.

We can try Option 2.2 first. For padding values it is common to pad with border pixels.

The only remaining question is the privacy one. If I remember correctly, the clap property is ignored in Chrome for this reason (meaning, always display all pixels). How would it be different for grid? Is it related to the use of should wording on the former and shall on the latter?

If I understand it correctly, Option 2.1 and Option 2.2 are not mutually exclusive, so we can still do Option 2.1 in the future.

Sure, if you are talking about the second alternative of option 2.1 (add output_width and output_height to avifEncoder).

wantehchang commented 2 years ago

The cropping done by a grid image is limited to trimming on the right and the bottom. This is why the privacy concern is less serious than clap. But the cropping of a grid image was also brought up in the discussions of the privacy issues of clap.

y-guyon commented 2 years ago

We can try Option 2.2 first. For padding values it is common to pad with border pixels.

See https://github.com/AOMediaCodec/libavif/pull/1143.

AOMediaCodec / libavif