AOMediaCodec / libavif

libavif - Library for encoding and decoding .avif files
Other
1.53k stars 196 forks source link

libavif cannot encode a grid if image dimensions are not multiples of cell size #1141

Closed y-guyon closed 1 year ago

y-guyon commented 2 years ago

Note: This is about AVIF grids/cells (used for incremental decoding), not about AV1 tiles.

Issue

libavif does not provide a way through avifenc or its API to encode an image as a grid of multiple cells if the image dimensions are not multiples of the cell dimensions.

For example avifenc --grid 4x2 on an image of 4002 by 1998 will fail.

Specification analysis

HEIF (ISO 23008-12:2017) and HEIF (ISO 28002-12:2021) say the same:

6.6.2.3.1 Image grid derivation Definition

An item with an item_type value of 'grid' defines a derived image item whose reconstructed image is formed from one or more input images in a given grid order within a larger canvas.

The input images are inserted in row-major order, top-row first, left to right, in the order of SingleItemTypeReferenceBox of type 'dimg' for this derived image item within the ItemReferenceBox. In the SingleItemTypeReferenceBox of type 'dimg', the value of from_item_ID identifies the derived image item of type 'grid', the value of reference_count shall be equal to rows*columns, and the values of to_item_ID identify the input images. All input images shall have exactly the same width and height; call those tile_width and tile_height. The tiled input images shall completely “cover” the reconstructed image grid canvas, where tile_width*columns is greater than or equal to output_width and tile_height*rows is greater than or equal to output_height.

The reconstructed image is formed by tiling the input images into a grid with a column width (potentially excluding the right-most column) equal to tile_width and a row height (potentially excluding the bottom-most row) equal to tile_height, without gap or overlap, and then trimming on the right and the bottom to the indicated output_width and output_height.

NOTE 1 If the desired input images are not of a consistent size, then derived image items that scale or crop them, as needed to make them consistent, can be used; other specifications can, however, restrict whether derived image items are permissible as input to the image grid derived image item. This document specifies cropping in 6.5.8 and scaling in 6.5.13.

NOTE 2 File writers need to be careful when removing an item that is marked as an input image of an image grid item, as the content of the image grid item may need to be rewritten.

My interpretation of the highlighted sentence:

This contradicts the following:

All input images shall have exactly the same width and height

Example

An encoded AVIF bitstream with the following AV1 frames and the grid properties output_width and output_height set to 108:

[100x100] [8x100]
[100x  8] [8x  8]

Can only be decoded by libavif if the ispe properties are all the same, so for example:

[100x100] [100x100]
[100x100] [100x100]

libavif will then:

  1. Decode each AV1 frame
  2. Resize each AV1 frame to the item's ispe dimensions (hence 100x100), if not already matching that size
  3. Combine all AVIF tiles into the reconstructed grid image (here 200*200)
  4. Trim the reconstructed grid image to output_widthxoutput_height (here 108x108)

The file itself is valid but the output is not what was intended (the right and bottom parts are stretched and cut). Removing the ispe contraint would make that possible.

Option 1: allow different dimensions for right-most and bottom-most cells

The implementation of https://github.com/AOMediaCodec/libavif/pull/1140 matches this option.

Issues:

Option 2: only trim

If we still want to allow encoding images with dimensions that cannot be a convenient multiple of tile_width and tile_height in libavif, there are two ways, both based on enforcing all cells to share the same dimensions, and then cropping to output_width and output_height.

Issues:

Option 2.1: add/modify avifEncoderAddImageGrid() API

This will put the burden of generating cells of the same size to the user. Also there is currently no way to pass the desired output_width and output_height to the avifEncoderAddImageGrid() function or as a flag to avifenc.

Issues:

Alternatives for avoiding a breaking API change:

Option 2.2: keep the same API but fix the issue internally

Basically add a step between avifEncoderAddImageGrid() and avifEncoderAddImageInternal() to convert the "imperfect" grid into a "perfect" grid.

Issues:


Somewhat related: https://github.com/MPEGGroup/MIAF/issues/11

joedrago commented 2 years ago

I feel like this doesn't require any changes in the library, but just a feature in avifenc itself. Grid cells are supposed to be the same size, so I think the library is behaving properly, and I think they expect you to use a crop rect (clap) to dial in the correct size.

This avifenc feature would simply round up the image's size to the next grid cell multiple, and then clap the resultant encoding back to the original dimensions. I don't believe this is one of the listed Options.

wantehchang commented 2 years ago

Yannis,

Thank you for the analysis of the spec. I agree there is a contradiction. I think the intention is "All input images shall have exactly the same width and height." Given this assumption, we can fix the contradiction by removing "(potentially excluding the right-most column)" and "(potentially excluding the bottom-most row)".

This assumption implies Option 2. We can try Option 2.2 first. For padding values it is common to pad with border pixels.

If I understand it correctly, Option 2.1 and Option 2.2 are not mutually exclusive, so we can still do Option 2.1 in the future.

y-guyon commented 2 years ago

I feel like this doesn't require any changes in the library, but just a feature in avifenc itself.

If users of the libavif API would like to generate incrementally decodable images of any size, it would be convenient for them to have a function doing the dirty work of splitting and padding. If we do this work in avifenc, we might as well make it accessible in avif.h "for free".

Grid cells are supposed to be the same size, so I think the library is behaving properly

It matches one interpretation of the specification, yes.

When you store (100+8)x(100+8) AV1 samples, you can only end up with a (100+100)x(100+100) grid item output, where some tiles were deformed by scaling up. The grid item is then cropped to 108x108 (with output_width/height or clap) so some scaled AV1 samples are discarded. \ And on the other hand, if you want the correct output for a 108x108 input, it must be padded to 200x200 before encoding, and cropped at decoding to 108x108. \ I was mainly pointing out the oddness of being able to store exactly 108x108 AV1 samples in a valid grid AVIF without being able to correctly decode them into a 108x108 image.

I think they expect you to use a crop rect (clap) to dial in the correct size.

This avifenc feature would simply round up the image's size to the next grid cell multiple, and then clap the resultant encoding back to the original dimensions. I don't believe this is one of the listed Options.

Using the clap feature is unnecessary. As mentioned in section 6.6.2.3.1, the reconstructed image is formed by [...] trimming on the right and the bottom to the indicated output_width and output_height. Since output_width and output_height are encoded with every grid box, there is no need for an additional clap box which would have the same effect. The clap box has the same odd dimensions contraints on subsampled chroma samples as grid if I remember correctly.

We can try Option 2.2 first. For padding values it is common to pad with border pixels.

The only remaining question is the privacy one. If I remember correctly, the clap property is ignored in Chrome for this reason (meaning, always display all pixels). How would it be different for grid? Is it related to the use of should wording on the former and shall on the latter?

If I understand it correctly, Option 2.1 and Option 2.2 are not mutually exclusive, so we can still do Option 2.1 in the future.

Sure, if you are talking about the second alternative of option 2.1 (add output_width and output_height to avifEncoder).

wantehchang commented 2 years ago

The cropping done by a grid image is limited to trimming on the right and the bottom. This is why the privacy concern is less serious than clap. But the cropping of a grid image was also brought up in the discussions of the privacy issues of clap.

y-guyon commented 2 years ago

We can try Option 2.2 first. For padding values it is common to pad with border pixels.

See https://github.com/AOMediaCodec/libavif/pull/1143.