HEIF section 6.6.2.3.1 Image grid derivation Definition is contradictory

y-guyon commented 1 year ago

HEIF (ISO 23008-12) section 6.6.2.3.1 "Image grid derivation Definition" contains the following sentence:

All input images shall have exactly the same width and height; call those tile_width and tile_height.

but also:

The reconstructed image is formed by tiling the input images into a grid with a column width (potentially excluding the right-most column) equal to tile_width and a row height (potentially excluding the bottom-most row) equal to tile_height, without gap or overlap, and then trimming on the right and the bottom to the indicated output_width and output_height.

These seem contradictory to me. Either not all input images have the same dimensions (for example the right-most column is narrower) and the first sentence is false, or they do all have the exact same dimensions and there is nothing to potentially exclude. Trimming is done after all that anyway (indicated by the "then"), so there is no need to allow for a different right-most column width or bottom-most row height before trimming.

I suggest either:

Removing "(potentially excluding the right-most column)" and "(potentially excluding the bottom-most row)".
Rewording the "then", to indicate that some cells are potentially excluded because they are trimmed, not before.
Weakening the "all input images shall have exactly the same width and height" constraint. This is a breaking change. The advantage of this solution is that you could have a HEIF grid with output_width and output_height that are not multiples of tile_width and tile_height, without storing more pixels than necessary. See https://github.com/AOMediaCodec/libavif/issues/1141 for more information.

leo-barnes commented 1 year ago

I would vote for Removing "(potentially excluding the right-most column)" and "(potentially excluding the bottom-most row)".

The last option is too much of a breaking change and doesn't really serve a purpose. If you want tiles that have different sizes, you're really meant to use an overlay.

y-guyon commented 1 year ago

I would vote for Removing "(potentially excluding the right-most column)" and "(potentially excluding the bottom-most row)".

Fine by me.

The last option is too much of a breaking change

I agree.

and doesn't really serve a purpose.

You can definitely do without it, but as I said in the other issue linked above, I find it weird to be able to store exactly WxH AV1 samples in a valid AVIF grid without being able to decode it as a WxH image made of these same pixels, WxH being 100x100 pixels in a 2x2 grid for example.

If you want tiles that have different sizes, you're really meant to use an overlay.

An overlay requires a background image of the dimensions of the decoded image, right? If so, it has the same issue of having to encore more pixels than the decoded dimensions. Does libavif even support overlays as of today?

wantehchang commented 1 year ago

I would also vote for Removing "(potentially excluding the right-most column)" and "(potentially excluding the bottom-most row)".

I suspect "(potentially excluding the right-most column)" and "(potentially excluding the bottom-most row)" are intended to indicate that only the right-most column and the bottom-most row may be trimmed.

Yannis: libavif doesn't support overlays.

leo-barnes commented 1 year ago

You can definitely do without it, but as I said in the other issue linked above, I find it weird to be able to store exactly WxH AV1 samples in a valid AVIF grid without being able to decode it as a WxH image made of these same pixels, WxH being 100x100 pixels in a 2x2 grid for example.

If you want tiles that have different sizes, you're really meant to use an overlay.

An overlay requires a background image of the dimensions of the decoded image, right? If so, it has the same issue of having to encore more pixels than the decoded dimensions. Does libavif even support overlays as of today?

Still not sure I understand what you're saying here. The whole point of the grid is to tile an image into tiles of the same size. When encoding, if the image dimensions is not a multiple of the tile dimensions, you need to pad the image edges so that they are (you're not meant to stretch them). All iPhone images are grid images with dimensions that are not a multiple of the tile size. We use a 8x6 grid with tile size 512x512. This means we have a canvas of 4096x3072, but the actual image is only 4032x3024. The rightmost and bottommost tiles are padded so that they are also 512x512.

If libavif doesn't support grids that are not an exact multiple of the tile size, that's a limitation by libavif, not by the spec.

The reasoning behind this is to make sure that grid decoding and encoding can be done efficiently with HW. Since all the tiles have the same dimensions and codec config, they can use the same HW session. This means you don't need to make context switches in the HW and the HW can do optimal pipelining.

If you want tiles of varying sizes you're meant to use an overlay. But you then loose the efficiency of being guaranteed that all tiles have the same size. Overlays don't require a background image. You just need to specify a background color that is shown for any pixels of the overlay not covered by an actual layer.

y-guyon commented 1 year ago

The whole point of the grid is to tile an image into tiles of the same size. When encoding, if the image dimensions is not a multiple of the tile dimensions, you need to pad the image edges so that they are (you're not meant to stretch them). All iPhone images are grid images with dimensions that are not a multiple of the tile size. We use a 8x6 grid with tile size 512x512. This means we have a canvas of 4096x3072, but the actual image is only 4032x3024. The rightmost and bottommost tiles are padded so that they are also 512x512.

Thank you for the explanation.

If libavif doesn't support grids that are not an exact multiple of the tile size, that's a limitation by libavif, not by the spec.

At encoding, not yet. https://github.com/AOMediaCodec/libavif/pull/1143 was meant to fix that.

The reasoning behind this is to make sure that grid decoding and encoding can be done efficiently with HW. Since all the tiles have the same dimensions and codec config, they can use the same HW session. This means you don't need to make context switches in the HW and the HW can do optimal pipelining.

I can imagine HW needs such constraints for efficiency. But are you talking about codec-level "pixel-decoding" HW, container-level "tile-rendering" HW, or both? Because for the former, the current HEIF specification does not enforce any codec-level dimensions in each cell of a grid. On the contrary, it even allows stretching, which is not a common step in optimal pipelining.

Still not sure I understand what you're saying here.

So my question was: why is smaller AV1 image + stretching allowed in an AVIF grid cell, but smaller AV1 image without stretching forbidden in a grid cell?

Anyway it is not that important so let's not go too deep into the discussion.

If you want tiles of varying sizes you're meant to use an overlay. But you then loose the efficiency of being guaranteed that all tiles have the same size. Overlays don't require a background image. You just need to specify a background color that is shown for any pixels of the overlay not covered by an actual layer.

I was not aware of that HEIF feature. It could have been a potential tool for incremental decoding, but implementing it efficiently and securely in libavif is probably involved.

denoualf commented 1 year ago

On the contradiction between:

 All input images shall have exactly the same width and height; call those tile_width and tile_height.

and

The reconstructed image is formed by tiling the input images into a grid with a column width (potentially excluding the right-most column) equal to tile_width and a row height (potentially excluding the bottom-most row) equal to tile_height, without gap or overlap, and then trimming on the right and the bottom to the indicated output_width and output_height.

It would like to recall that this was the intended design in HEIF:

the 1st statement avoids indicating the size of the cells (they correspond to the size of input images); instead a number of cells per line or column is indicated (rows_minus_one and columns_minus_one)
the 2nd statement (trimming right and bottom) is to make sure that the resulting image from the grid fulfills the output_width and output_height indicated in the grid payload.

On fitting the "tiles" of different sizes into the grid, please look at the NOTE 1 in section 6.6.2.3.1.

The grid construction is something that is now widely used and we would recommend to not change its definition.

y-guyon commented 1 year ago

the 2nd statement (trimming right and bottom) is to make sure that the resulting image from the grid fulfills the output_width and output_height indicated in the grid payload.

Trimming the reconstructed grid is fine if all cells share the same dimensions. It just does not make sense to crop the reconstructed grid if the bottom-most and right-most column were not even respecting tile_width and tile_height *before the cropping*.

The grid construction is something that is now widely used and we would recommend to not change its definition.

"Not change its definition" as in "not clarify the wording" or "not change the behavior"?

Multiple persons voted on Removing "(potentially excluding the right-most column)" and "(potentially excluding the bottom-most row)". How about we just do that?

leo-barnes commented 1 year ago

@denoualf

the 2nd statement (trimming right and bottom) is to make sure that the resulting image from the grid fulfills the output_width and output_height indicated in the grid payload.

Sure. But the wording is slightly contradictory. All the grid cells are the same size. Even the edge cells. Trimming is done afterwards. So the text that says (potentially excluding the **) is just confusing. If trimming is done after compositing, all the grid columns and rows are the same size and the text in the parenthesis is redundant (and confusing).

On fitting the "tiles" of different sizes into the grid, please look at the NOTE 1 in section 6.6.2.3.1.

Right. Derivations can be used to make input images actually have the correct size (or you could use transform properties to scale/crop/rotate). But there is no automatic scaling, cropping or similar. The size of the input images have to match the tile size.

leo-barnes commented 1 year ago

The agreement when discussed in this MPEG meeting was that I would read through the discussion and write up a proposal for the next meeting on how the text should be changed.

Removing "(potentially excluding the right-most column)" and "(potentially excluding the bottom-most row)" to me makes the text clearer without changing how it works, so that will be my suggestion for the next meeting.

y-guyon commented 1 year ago

Side note: \ We could encode exactly output_width×output_height pixels in a grid even if the dimensions are not multiple of output_width and output_height by using derived image items for the bottom-most and right-most cells, with a transformative item property that adds padding. But I think there is no padding property.

leo-barnes commented 1 year ago

MPEG 143: Accepted into Potential improvements.

leo-barnes commented 1 year ago

@cconcolato @y-guyon I don't have the ability to close this issue it seems.

MPEGGroup / FileFormat

HEIF section 6.6.2.3.1 Image grid derivation Definition is contradictory #66