Reducing metadata overhead, especially with tiled images

MPEGGroup / FileFormat

MPEG file format discussions

23 stars 0 forks source link

Reducing metadata overhead, especially with tiled images #106

Open farindk opened 1 month ago

farindk commented 1 month ago

HEIF images that consist of many tiles contain a lot of redundant information:

each tile has an infe item entry with usually identical content,
the ipma property associations for each tile are usually referring to the same properties,
the iref of type dimg, defining the grid image tiles, usually consists of a long list of consecutive item IDs (1,2,3,4,5,6, ...).

On images with many tiles (#105), this can lead to significant overhead. E.g. 21 bytes for each infe, 2 or 4 bytes for each iref reference, and usually >=6 bytes for each ipma entry. For 256x256 tiles, this sums to at least 2 MB of unnecessary overhead.

In order to reduce this amount of data, I propose the following changes:

Extend infe to hold not only an item_ID, but a range of item_IDs. The content of a single infe will then define all items in the given range. This could be implemented with a new infe box version or flag that enables storing an item_ID_count. The range would then be defined by item_ID to item_ID + item_ID_count.
Similarly, a new ipma entry could store a range of item_IDs instead of a single item_ID to which it relates in order to eliminate the unnecessary copies.

iref could be extended with a version=2, operating as follows:


class ItemReferenceBox extends FullBox(‘iref’, version, 0) {
...
if (version==2) {
SingleItemTypeReferenceBoxRanges references[];
}
}

class SingleItemTypeReferenceBoxRanges(referenceType) extends Box(referenceType) { unsigned int(32) from_item_ID; unsigned int(32) reference_count; for (j=0; j<reference_count; ) { unsigned int(32) to_item_ID; unsigned int(32) count_minus_one;

j = j + count_minus_one + 1;

} }


This defines a simple run-length encoding, but still allows to assign non-consecutive itemIDs.
For example, the `iref` list `[1,2,3,4,5,6,7,8,9,10]` would be encoded simply as `{1,9}` (`to_itemID=1`, `count_minus_one=9`), and `[1,2,3,4,5, 12, 7,8,9,10]` as `[{1,4}, {12,0}, {7,3}]`.
Other, improved coding schemes are also possible, of course.

leo-barnes commented 1 month ago

This seems like a reasonable thing to add.

It kind of relates to the exploration on compressed meta boxes that was discussed at the last meeting. That proposal was dropped in favor of the current dedicated low-overhead approach since that gives better results for simple files, but the conclusion of the discussion was that it would help a lot with large and complicated files.

Running deflate on the meta box would give you even better compression than the syntax above, but you would still have to parse all of the entries. So maybe we the best approach is to have both.

y-guyon commented 1 month ago

For 256x256 tiles, this sums to at least 2 MB of unnecessary overhead.

Could you share a range of the expected file size of a file representative of the use case you mentioned?

HEIF images that consist of many tiles

Is this use case sharing other complex ISOBMFF/HEIF features?

If so, maybe these 2MB are insignificant compared to other bloated meta components.
Otherwise, should we consider adding container tiles to SlimHEIF (w23988)?
I would be against such a change for the sake of simplicity and because codec tiles can be used instead of container grids.

farindk commented 1 month ago

Running deflate on the meta box would give you even better compression than the syntax above, but you would still have to parse all of the entries. So maybe we the best approach is to have both.

A simple encoding that matches the typical data is usually better than deflate:

I tried to compress the byte sequence consisting of 0x00 to 0xff (0 1 2 3 4 ... 255). With deflate, those 256 bytes "compressed" to 261 bytes.

The 512 bytes sequence 0x0000 to 0x00ff (as they appear in the iref) only compressed to 347 bytes. The simple scheme like proposed above would only use 4 bytes for the same.

For the extreme case of 256x256 tiles, we have the iref sequence 0x0000 to 0xffff. These are 2*65536=131072 bytes (ignoring the fact that the grid image itself will need another ID, so that we would have to switch to 4 bytes, doubling the memory size). With deflate, this compresses to 118960 bytes, which is only 10% smaller. With the simple scheme above, still only 4 bytes.

leo-barnes commented 1 month ago

Running deflate on the meta box would give you even better compression than the syntax above, but you would still have to parse all of the entries. So maybe we the best approach is to have both.

A simple encoding that matches the typical data is usually better than deflate:

I tried to compress the byte sequence consisting of 0x00 to 0xff (0 1 2 3 4 ... 255). With deflate, those 256 bytes "compressed" to 261 bytes.

The 512 bytes sequence 0x0000 to 0x00ff (as they appear in the iref) only compressed to 347 bytes. The simple scheme like proposed above would only use 4 bytes for the same.

For the extreme case of 256x256 tiles, we have the iref sequence 0x0000 to 0xffff. These are 2*65536=131072 bytes (ignoring the fact that the grid image itself will need another ID, so that we would have to switch to 4 bytes, doubling the memory size). With deflate, this compresses to 118960 bytes, which is only 10% smaller. With the simple scheme above, still only 4 bytes.

Fair point. I would have expected deflate to be able to do better here, but it's been a long time since I looked into exactly how it codes stuff. I thought it could handle incrementing sequences, but I must have misremembered.

bradh commented 1 month ago

For 256x256 tiles, this sums to at least 2 MB of unnecessary overhead.

Could you share a range of the expected file size of a file representative of the use case you mentioned?

There are a couple of different aspects to this - "storage size on disk" and "actually transferred bytes" might be ways to describe it.

Reducing the meta size on a small image makes a proportionally larger difference to storage size for object storage, but may not make any difference for block storage (because its 4K blocks, typically).

On a huge image, reducing the meta size makes almost no difference proportionally to the overall file size. However if that is on an object store (like S3 bucket or other cloud equivalent), there is a lot to be gained by making the meta smaller. Consider that the user may not be viewing the whole image at full resolution. Instead, they are probably interested in a smaller region of interest. This is one of the cases for Cloud Optimised GeoTIFF (aka COG).

If the meta was smaller such that the initial HTTP GET operation (maybe for a byte range of the first 128K, 4M or something like that) got all of the information required to determine how to decode the structure, and ideally provide an overview image (top of the pymd), then we can provide a good user experience.

If meta is large, such that you miss part of it in the first GET, you have another round trip before you can do anything.

As the user zooms in, we can select which parts of the image we want using Range header byte ranges to only download the parts of the file we will show to the user. Actual transferred bytes could be (and often will be, in this use case) much less than the total size.

So while the total image size is still important, it may not be that useful a metric here.

y-guyon commented 1 month ago

Got it, thank you for the detailed explanation.

What would be the reduced meta size with this proposal, when applied to the 2MB example above?
I would imagine boxes such as iloc cannot be modified in the same run-length fashion as iref and others.
So maybe the whole meta size will remain in the same magnitude range; and deflate-like schemes could result in smaller files when applied to the whole meta box instead of applied to specific sequences as tested in https://github.com/MPEGGroup/FileFormat/issues/106#issuecomment-2228211076.

leo-barnes commented 1 month ago

In order to put some numbers on @y-guyon comment above, I decided to take a look at the largest HEIF file I have created. It's a 93356x32872 HEIF file using 10974 512x512 tiles, with each tile having ispe and hvcC. So not close to the limit of 64k tiles, but much larger than normal use-cases.

('meta' "Meta Box", size = 483819)
('iinf' "Item Information Box", size = 230552)
('iref' "Item Reference Box", size = 22014)
('iprp' "Item Properties Box", size = 55473)
    ('ipco' "Item Property Container Box", size = 560)
    ('ipma' "Item Property Association", size = 54905)
('iloc' "Item Location Box", size = 175664)

Size-wise, the iinf and iloc are the boxes to care mostly about, followed by ipma and iref.

Assuming 4 byte item IDs.
For iloc, assuming no base offset and offset and length sizes of 4.

iinf: 8 + 4 + 4 + 2 + 4 + 1 = 23 * numTiles
iref: 4 + 4 + 2 + numTiles 4 = 10 + numTiles 4
ipma: ((4 + 1) + 1 numProperties) numTiles
iloc: (4 + 2 + 2 + 2 + 4 + 4) numTiles = 18 numTiles

iinf, iref and ipma could all get dedicated run-length coding as proposed above, but that leaves us with iloc.

Taking the meta box from the file above, dumping it and running tar with various algorithms on it gives us the following sizes as reference:

no compression: 483819
gzip: 170099
bzip2: 143964
lzma: 73238

Taking the iloc box from the file above and doing the same exercise gives us:

no compression: 175664
gzip: 95520
bzip2: 87451
lzma: 50244

So deflate definitely helps. If used on the full meta box it gives us roughly the same size as you would get by implementing the dedicated run-length encoding schemes on iinf, iref and ipma. But the optimal approach would be to do all of the dedicated schemes and use something like lzma.

bradh commented 1 month ago

For the iloc entries in that specific, is it all single extent per item and all in mdat?

Of those extents, how many follow-on such that the first byte of item n is immediately after the last byte of item n-1?

My interest is in whether offset + extent for the previous item can determine the offset of this item.

So you could get an effect of "items 1 to 10974 have extents [ ], first item at (base) offset x".

leo-barnes commented 1 month ago

For the iloc entries in that specific, is it all single extent per item and all in mdat?

Yes

Of those extents, how many follow-on such that the first byte of item n is immediately after the last byte of item n-1?

All of them

My interest is in whether offset + extent for the previous item can determine the offset of this item.

So you could get an effect of "items 1 to 10974 have extents [ ], first item at (base) offset x".

Yes, that's definitely something that could be done.

The file basically has something like this:

...
      Item ID: 8
        Construction method: 0, Data reference index: 0
        Extent 1/1: Offset 1652670, Length 3092
      Item ID: 9
        Construction method: 0, Data reference index: 0
        Extent 1/1: Offset 1655762, Length 3118
      Item ID: 10
        Construction method: 0, Data reference index: 0
        Extent 1/1: Offset 1658880, Length 3088
      Item ID: 11
        Construction method: 0, Data reference index: 0
        Extent 1/1: Offset 1661968, Length 3219
      Item ID: 12
        Construction method: 0, Data reference index: 0
        Extent 1/1: Offset 1665187, Length 3206
...

For all of these the offset is the offset+length of the previous item. So the only thing that would really be needed would be to store the starting offset, the range of item IDs and the lengths for each item. In this case we could get away with using a 2-byte length for each entry which would save a lot of space.

If you want to get fancy you could also experiment with trying to predict the length from the previous length, but that could break down fast depending on your content and won't really save you that much unless we start doing variable length integers.

farindk commented 1 month ago

Here is an example image with 256x255 tiles: large tiled images (the image file is quite large as it uses JPEG compression). The meta box total size is 3,329,568 bytes. This includes:

iinf: 1,370,907 bytes
ipma: 391,700 bytes
iref: 130,584 bytes
iloc: 1,436,192 bytes

So deflate definitely helps. If used on the full meta box it gives us roughly the same size as you would get by implementing the dedicated run-length encoding schemes on iinf, iref and ipma.

~~I have to disagree.~~ With the syntax proposed above for iinf, iref, ipma, each of them would have about the size of a single item + 4 or 8 bytes. Or, if I counted correctly:

iinf : 38 bytes (iinf header + 1 infe entry for all tiles)
ipma: 25 bytes (ipma header + 1 item entry for all tiles)
iref: 30 bytes (iref header + 1 range for all tiles)

The iloc box could also use a much more compact syntax. The idea would be that for a range of items (all the tiles), we assume that they are stored in the mdat one after another and then only have to provide pointers to the start of each tile. This is similar to how h265 does this in the slice_segment_header:

h265 uses an adjustable number of bits for the size of each tile, but to simplify parsing and be in line with how other boxes work, we could use either 4 or 8 bytes for each tile. This would reduce the size of iloc to something like:

16 bytes for the box header
4 bytes for the size of each tile Total iloc size: 16 + 4 256 255 = 261,136 bytes.

If we would further allow tile sizes to be stored at lengths other than 4 or 8 bytes, this could be reduced even further to maybe at most half of that.

I can write a detailed proposal for an iloc syntax if that helps.

leo-barnes commented 1 month ago

@farindk

I have to disagree. With the syntax proposed above for iinf, iref, ipma, each of them would have about the size of a single item + 4 or 8 bytes.

You can't really disagree that it helps, I gave you actual numbers for an actual file. Just using deflate on the full meta is equivalent with completely removing iinf, ipma and iref. So deflate ~= implementing efficient schemes for iinf, iref and ipma.

That doesn't mean you can't get even better results by implementing the range schemes for iinf, iref, ipma and iloc. But the most optimal solution is very likely all of those schemes plus a lossless compression algorithm.

farindk commented 1 month ago

@leo-barnes Ah, sorry, I misunderstood your experiment. I tought you were applying deflate only on iinf,ipma, and iref and comparing that to the range-based syntax. But yes, if you also include iloc and don't apply an improved syntax for iloc, you will get those numbers.