Quantization: reduce memory footprint without compromising speed and quality

antonfirsov commented 4 years ago

All our current quantizers rely on a very heavy cache in order to produce acceptable quality: https://github.com/SixLabors/ImageSharp/blob/b06cb32b7114961fd5473f7645d38f8fee04ec64/src/ImageSharp/Processing/Processors/Quantization/EuclideanPixelMap%7BTPixel%7D.cs#L21

With large images this could consume up to several megabytes of memory every time a larger GIF (or palettized PNG) gets saved. We need to reduce memory usage without compromising speed and quality.

We made a couple of experiments and exchanged many ideas a couple of months ago, but unfortunately all those discussions got lost in the noise of the gitter chatroom. I suggest to recollect and rediscuss those ideas.

Improvements we may consider (according to what I remember from the gitter chat):

Improving the RGB Octree implementation to make it very fast by default so we hopefully don't need any cache. Most important: flattening the tree nodes into array(s) of structs (instead of heap objects).
Extend the octree to RGBA
Replace the dictionary with this cache. It produced promising results in my experiments.
New idea from @JimBobSquarePants: replace the dictionary with an LRU cache implementation like the one in BitFaster.Caching.

@saucecontrol since I remember you had really valuable ideas here, I would really appreciate your feedback on all 4 options, and apologies if I'm making you to repeat yourself because of my terrible memory :smile: If I remember correctly, you also had some promising results in your repos.

Bravely assigning milestone 1.1 for now.

saucecontrol commented 4 years ago

apologies if I'm making you to repeat yourself because of my terrible memory

I'm afraid I don't recall much detail of those conversations myself 😄

I did your option 1 for mine, using an octree for the histogram, palette generation, and palette lookup. Unfortunately, due to C#'s lack of first class support for discriminated unions and the JIT's insistence on inserting null checks when dereferencing a member from a struct ptr, the code ended up being quite ugly for perf reasons. But it gives really nice visual results with only about 280KiB of rented pool memory, and in my limited testing was on average an order of magnitude faster than the current ImageSharp octree implementation.

You can find my implementation here: https://github.com/saucecontrol/PhotoSauce/blob/master/src/MagicScaler/Magic/ColorQuantizer.cs

Mine differs quite a bit from any other octree implementation I've seen, which makes it a little difficult to follow, but there's method to any madness you see in there, and I'm happy to answer any questions. Unfortunately I haven't the drawing skills to do good diagrams of how it works, so the code will have to speak for itself for now.

The major architectural differences: 1) I made it GC allocation free by making the octree node a struct and renting a (256KiB) slab of memory from the ArrayPool to hold all the nodes I'll ever have. Once I fill that slab, I reduce the node set by cutting the depth of the tree by one level. Because the tree depth is variable, it has perfect fidelity for low-color images, and it gets faster per-color the more colors it encounters. And when reducing the palette from the possible 8K nodes in the histogram, I weight them based on population count and distance to the parent node color to prioritize the most important colors. This gives better results than simply merging the first available reducible node and allows the palette to have the max possible entries rather than coming up short when all children of a reducible parent are merged. 2) I use a set of LUTs to quickly build the entire octree index for any given input color by interleaving the channel bits, so navigating the tree can be done by simply shifting off 3 bits of that pre-calculated index at a time (I reverse the bit order so the next level index is the low 3 bits). In histogram mode, the tree keeps color values and pixel counts at the node level. In lookup mode, it keeps the palette index and value. I build the octree index in GRB order so that the index roughly approximates luma. When looking for the nearest color within the tree, the nearest populated index at a given level is also going to be the nearest color, provided you search them in the right order. If a node doesn't exist at the minimum leaf level I've defined, I do a full search through the palette and cache the result by creating the missing node.

The LinearBoxesMap idea is clever, but its weakness is of course that it has a pre-defined accuracy limit. If you're trying to fit a 300-color image into a 256-color palette, you should be able to keep perfect matches for the majority of the colors. An octree is great for that because it's a sparse structure, so as long as you make the lookups/navigation fast enough, it does a great job.

The same ideas could be applied to support RGBA in a 'hexadecitree' with 4 bits per node index instead of 3. The alpha channel could be weighted less or quantized in advance. I was only going for GIF support initially, so I haven't yet done any experiments to figure out the best number of nodes, etc.

antonfirsov commented 4 years ago

@saucecontrol thanks for the very detailed reply! What is the pixel format of Span<byte> image ? (Didn't have a chance do dig into your code yet.)

Our big problem is that currently we need the following cache to produce good enough quality when images are transparent: https://github.com/SixLabors/ImageSharp/blob/b06cb32b7114961fd5473f7645d38f8fee04ec64/src/ImageSharp/Processing/Processors/Quantization/OctreeQuantizer%7BTPixel%7D.cs#L114-L120

@JimBobSquarePants in case of pixels having no alpha channel, we likely don't need this path. For users caring about perf we can suggest then to work with Image<Rgb24> all the way. In the case when alpha channel is present: what if we premultiply with alpha before/while saving the image, instead of using the Euclidean map?

JimBobSquarePants commented 4 years ago

@antonfirsov I found that the difference in quality was noticeable with or without transparency when I experimented with removing the color map entirely.

Looking back at EuclideanPixelMap<TPixel> though I don't know why we use Vector4. Both Octree and Wu convert to Rgba32 when building histograms so we don't need the extra precision. Perhaps we can do something faster there?

saucecontrol commented 4 years ago

What is the pixel format of Span image ?

My implementation works only on BGR24 input. In the final palette derivation step, when merging nodes, I convert the colors to 32bpc linear for better blending/averaging results.

I also have only a single dithering algorithm (a modified Floyd-Steinberg) integrated directly into the palette matching for perf reasons. I didn't really investigate which part of the speed differences between mine and the current ImageSharp implementation come from the histogram vs lookup vs dithering parts of the code, but if your main concern is reducing memory, I do believe you can eliminate the cache without negative perf impact as long as your octree traversal is fast. And of course a faster octree speeds up your histogram step as well so it may be a win-win.

The key to keeping memory usage under control while keeping accuracy is to allow the lookup octree to reach full 8 bit depth for the nodes that contain your palette entries but then restrict the depth for parts of the gamut not covered by the palette. My implementation caps that at 4 bits (a max of 4096 nodes) so I can fit everything in my 256KiB slab. You give up some accuracy on those, but since they represent colors not in the original image anyway (they're used for colors created by dithering error propagation), it's not a big deal.

Again, my use case was animated GIF, so it was very important to me to get it completely optimized for processing hundreds of RGB frames at a go. I don't know if you'd want to generalize your implementation more to allow the same code to handle RGB and RGBA, but if it were me, I'd have separate implementations for those. The RGBA path would naturally require more memory, but my guess would be that around 2MiB would do it if you do some psychovisual trickery with the alpha channel ;)

antonfirsov commented 4 years ago

Looking back at EuclideanPixelMap though I don't know why we use Vector4. Both Octree and Wu convert to Rgba32 when building histograms so we don't need the extra precision. Perhaps we can do something faster there?

https://github.com/SixLabors/ImageSharp/blob/b06cb32b7114961fd5473f7645d38f8fee04ec64/src/ImageSharp/Processing/Processors/Quantization/EuclideanPixelMap%7BTPixel%7D.cs#L20-L21

@JimBobSquarePants Vector4 is not the problem here since paletteCache has only 256 elements. It's the dictonary that grows out of control. I agree with @saucecontrol that it's the fast Octree itself that should be play the role of the cache and ideally we should eliminate the dictionary, or any similar heavy lookup. Based on https://github.com/SixLabors/ImageSharp/issues/1350#issuecomment-692979241, I think MagicScaler's Octree should provide better results in terms of quality than ours. For me this sounds like something worth a POC as a next step. @JimBobSquarePants you still skeptical? Do you see any issues I don't? (Especially re dithering.)

Note: converting pixel spans to Bgr24 should be cheap after implementing #1354

JimBobSquarePants commented 4 years ago

@antonfirsov I only mention the Vector4 usage because converting to/from RGB compatible types is much faster.

My biggest concern here is that we are focusing too much on Octree. All of the quantizers use the pixel map to improve output quality (due to naïve color distance formula) and two of the quantizers depend on it entirely for palette application.

https://github.com/SixLabors/ImageSharp/blob/b06cb32b7114961fd5473f7645d38f8fee04ec64/src/ImageSharp/Processing/Processors/Quantization/PaletteQuantizer%7BTPixel%7D.cs#L63-L64

So... Perhaps we can do something to allow fast linear color space conversion using LUTs since we're working with pixel formats with byte accuracy in the quantizers @saucecontrol I know you do this but I can never read your code (I'm convinced you've been sent here from the future) which could improve the accuracy of the built in palette matching mechanisms. Maybe if the quality is improved we can drop the pixel map entirely.

Then we can figure out what to do for the other quantizers that depend on it.

The pixel map was always a crutch. That's why I buried it far away from the API surface.

saucecontrol commented 4 years ago

@JimBobSquarePants I think I might be confused about how your implementation works now. I thought the purpose of the pixel map was to improve speed by eliminating the need for a full search through the palette to find the nearest palette color for each input pixel color. You're saying it has something to do with improving quality? How does that work?

You can actually use an octree for palette lookup regardless of how the palette was originally obtained, so the idea isn't necessarily that you go all octree all the time. If you needed to map an image to e.g. a standard web palette, you would build out an octree with the values from the palette to start. As you map the input image to the palette, any time you navigate the octree and the node for your pixel color doesn't exist, you do the linear search through the palette and then cache the winner by creating that node so that next time you see that color or one like it, it's already there. Essentially, the octree works as a sparse cache in that mode. The real win is in the fact that it can group similar colors instead of having a key per distinct input value, because you pick the max tree depth for those cache nodes. It's actually very similar to @antonfirsov's LinearBoxesMap concept except that the accuracy is variable (from 4 to 8 bits in my implementation) instead of being fixed at e.g. 5 bits, and any unused part of the color space never gets allocated any memory space.

JimBobSquarePants commented 4 years ago

@saucecontrol Using the pixel map actually results in much slower quantization than not using one.

The quality improvement is achieved by finding the closest color in the resolved palette to the given pixel via Euclidian distance. I originally started using that approach because a bug in my dithering caused exceptions when trying the match the dithered result against the palette. I stick with it now that issue is gone because I still see quite dramatic differences in quality when compared against both Wu and Octree quantizers.

The dictionary caches results since the distance calculation is horribly slow.

https://github.com/SixLabors/ImageSharp/blob/b06cb32b7114961fd5473f7645d38f8fee04ec64/src/ImageSharp/Processing/Processors/Quantization/EuclideanPixelMap%7BTPixel%7D.cs#L72-L103

I'm intrigued by the variable bit depth approach. From our previous conversations the output of your quantizer color wise was amazing (though you were having dithering issues)

antonfirsov commented 4 years ago

All of the quantizers use the pixel map to improve output quality

This is the problem. We need a significantly more memory-efficient lookup mechanism.

Octree comes in the picture as probably the best candidate for such a lookup in case of 3-component images, assuming that @saucecontrol 's implementation is producing decent quality, and it's possible to integrate it with our generic dithering logic.

saucecontrol commented 4 years ago

Ah, yeah, we're talking about the same thing. I think @antonfirsov and I were phrasing it poorly by talking about eliminating the cache. The map is a necessity as long as you're doing dithering, because even if you had a perfect histogram to start (which you can't always), error diffusion might still introduce colors that fall outside the input image's gamut. And you'll want that map to cache any new palette entries it has resolved.

The real issue is that a dictionary isn't a good map/cache because its keys are opaque integers and its allocations are GC-heavy. Using an octree as the map/cache can potentially solve both of those.

I don't know if there's any part of my code you can lift directly, just because it's so tuned for speed specific to my use case. Reviewing it now, there's some ugly stuff in there. Conceptually, though, I think you could model after it because I haven't seen anything that balances quality, speed, and memory as well it does.

I did end up resolving my dithering issues we chatted about on gitter, by the way. My final implementation just ended up being Floyd-Steinberg with the error weighted to 7/8. I'd say the quality is generally equal or better than what ImageSharp does today, although there's not a ton of improvement to be had there -- your implementation is already very good quality-wise.

JimBobSquarePants commented 4 years ago

The real issue is that a dictionary isn't a good map/cache because its keys are opaque integers and its allocations are GC-heavy. Using an octree as the map/cache can potentially solve both of those.

Yep, that's why I was going to just insert an LRU cache in there for now but if there's a better cache then I'm happy for anyone to have a go at replacing it.

If there was a cheaper way to match the dithered result to the palette and retain (or improve) output quality then I'm all ears. It'll need someone else to experiment/implement though.

antonfirsov commented 4 years ago

I don't think a dictonary-based LRU cache is a good candidate, it's still very heavy, and at the same time likely much slower than a dictionary. In other words: it seems like a low hanging fruit, but I can bet for 95% that it just won't work well enough, so I would discourage investing into it.

If we want a "best effort" map as a quick workaround, let's use LinearBoxesMap. It can cache up to 4 colors per 5-byte box for exact match, just like a dictionary. I made some experiments to investigate the likehood of colors overflowing boxes, and for most palettes it was very unlikely. In those rare cases when a color does not fit, the map will return "misses" and the linear euclidean search path activates (just like it would happen with LRU). For images with less uniform palette distribution (eg. monochrome or monocolor images), the result will be much worse of course.

antonfirsov commented 4 years ago

But in general, I would suggest investing into @saucecontrol 's octree instead of trying half-solutions.

I don't know if there's any part of my code you can lift directly, just because it's so tuned for speed specific to my use case.

What bits are specific to your use case? Is there anything else than the hard-coded dithering and pixel format?

if you had a perfect histogram to start (which you can't always), error diffusion might still introduce colors that fall outside the input image's gamut. And you'll want that map to cache any new palette entries it has resolved.

Without doing too much reading, I would assume it works as following in your code:

GetPaletteIndex(color):
    ditheredColor = dither(color)
    return octree.FindIndex(ditheredColor) # can this can add a new palette entry ??

So the difference between MagicScaler and ImageSharp implementation is that the magic octree integrates the logic for dither(color) right into the quantizer code, while we are doing it outside (GetPaletteIndex is already is querying for a dithered color).

If my assumptions are correct, adapting MagicScalers octree should be easy. All we need to change is:

Remove the dithering bits from the code
Convert TPixel -s to Bgr24 as we go

@saucecontrol @JimBobSquarePants anything I'm missing?

JimBobSquarePants commented 4 years ago

@antonfirsov What about Rgba32? The Wu Quantizer supports an alpha component. As do the palette quantizers.

antonfirsov commented 4 years ago

I would focus on the gif encoding use-case (= current OctreeQuantizer) first, as it seems to be the most painful. The remaining use-cases can survive with EuclideanPixelMap until we find something better. (Best solution candidate probably an extended RGBA "octree")

As such: When alpha is present in the encoded image, we can blend with a pre-defined background color before / during the quantization, and convert the resulting pixel data to Bgr24. (+ have a hard coded path mapping fully transparent pixels to black probably.)

JimBobSquarePants commented 4 years ago

Bear in mind that quantizer can be set to any IQuantizer value. Octree is only the default because it supports exactly one transparent color

antonfirsov commented 4 years ago

Output is always BGR with gif, so we can use it as color map with other quantizers in case of gif, if we want.

But I believe most users go with defaults. Having performant defaults for the most common use-cases is what we should focus on first. This will give us time, and it will be also easier to improve the general cases based on the leanings we acquire by solving the common/default ones.

saucecontrol commented 4 years ago

@antonfirsov I think there are a few other things that would have to be changed to fit my implementation to your use case, but it's all manageable. There's already a non-dithering mapping (remap method) in place, but it has some behaviors that probably won't fit your needs. Off the top of my head:

It assumes the palette always has 256 entries and that the last entry is transparent. I always add a transparent palette entry even if the original image didn't have transparency because for animated GIF my encoder does subframe encoding and keyframing with transparent pass-through to the previous frame.
- It does thresholding for the GIF transparency with a fixed threshold. You mentioned background blending those in https://github.com/SixLabors/ImageSharp/issues/1350#issuecomment-694279342, which is something I've thought about but haven't implemented.
- It uses the palette index 255 to mark non-leaf nodes since the transparent handling is done outside the octree. This is part of the dynamic tree depth logic since a node can be a leaf at any level. It either has a real palette index, or it has 255 which means the node has children. That's an optimization that allows me to quickly determine when I've hit the leaf level.
- That magic number palette entry logic is part of an overall theme of packing fields to make sure I can fit an octree node into 32 bytes for space and SIMD optimization. That's the only hack I can think of for the mapping octree -- it gets uglier in the histogram octree.
It uses a free list to keep track of the next available slot for allocating a new node in the rented memory slab. This would not be necessary if the octree is being built up only during the mapping stage, since it can never exceed the slab's size. I use the same logic for both the histogram and the map, so I build out a dummy sequential free list for the map.
Since it always has a histogram octree to start with, it uses that to build out the mapping octree with the leaf nodes already at the optimal level. If starting from scratch (from a palette), you wouldn't know how many bits were used when generating a given palette entry, so all palette entry nodes would have to go at the lowest octree level.

That's all I can think of for now, but if you do go down that path and find anything that doesn't make sense, I can probably shed some light.

rickbrew commented 3 years ago

I have similar code in Paint.NET for image quantization when saving indexed images (8-bit, 4-bit, 2-bit, 1-bit). After palette generation, a class called PaletteTable holds the (up to) 256 color palette and has a GetClosestColor() method that does a linear search w/ Euclidean distance metric. It also uses a Dictionary for caching the results, and it does help a lot. But, memory usage is exactly what you expect: proportional to the total # of unique colors in the image. Usually it's fine and the speedup is worth it for this case, although it's still nowhere near as fast as I'd like. It still feels slow, in other words.

I researched how to better implement PaletteTable and GetClosestColor and came across something called a k-d tree https://en.wikipedia.org/wiki/K-d_tree . It's like a binary tree that pivots on successive axes at each level of the tree (e.g. R, then G, then B). I believe this is the best data structure / algorithm for this, but haven't had the time or justification to dive into it and prove that. It should only only need O(n) memory where n <= 256 (the # of colors, i.o.w.). GetClosestColor should then be able to run in O(log2 n) time for each call.

antonfirsov commented 3 years ago

@rickbrew I made experiments with two KD tree implementations:

One "traditional" with heap-allocated nodes, (C) @brianpopow
Another one using a binary heap data structure and flat buffers in the hope of improving memory access patterns.

Unfortunately, even with 2, this turned out to be almost as slow as the linear search with the Euclidean distances. O(log2 n) doesn't really help, if the traversal steps are too complex. (-> branch mispredictions, cache misses)

rickbrew commented 3 years ago

Thanks @antonfirsov. I suspected that the cost of the data structure and lookup wouldn't result in a 16x increase in perf (that is, O(256) vs O(log2 256) = O(16). Good to know!

Maybe I'll play around with your implementation at some point to see if I can squeeze some better performance out of it (I only glanced through it -- it may already be maxed out!). If I do I'll post back here with my improvements.

JimBobSquarePants commented 3 years ago

This could be of interest. Appears to be faster than a KD Tree

https://github.com/spotify/annoy

https://www.slideshare.net/erikbern/approximate-nearest-neighbor-methods-and-vector-models-nyc-ml-meetup

https://stackoverflow.com/questions/37105782/performance-of-annoy-method-vs-kd-tree

saucecontrol commented 3 years ago

@JimBobSquarePants following our chat a couple weeks back on Discord, I did a bit more experimentation with our quantizers, and I discovered a couple of interesting things. 1) MagicScaler was doing significantly better at generating a palette for a given image but 2) ImageSharp was doing significantly better at quantizing an image to a given palette. The net result was that MagicScaler tended to edge out ImageSharp in quality because dithering hides a great many faults, and a palette with a better gamut match will do better in either case. But in running both without dithering, and especially in trying ImageSharp's quantization with a MagicScaler-created palette, I could see the shortcomings in my mapping quality.

With that, I set about overhauling my implementation, and I ended up with something that improves the accuracy of the mappings while being just a bit faster than it was before (and in half the memory). I've also scrapped the code that preserved the histogram tree to use as the basis of the mapping tree, meaning it now builds out the mapping tree straight off the palette, which would make that code easier to adapt to your needs. The code is hopefully much easier to follow now as well, because I've split the octree nodes used for the histogram phase and the nodes for the mapping phase into separate structs. The new version is here: https://github.com/saucecontrol/PhotoSauce/blob/master/src/MagicScaler/Magic/OctreeQuantizer.cs

rickbrew commented 3 years ago

@JimBobSquarePants @saucecontrol I've also spent a lot of time in the Paint.NET quantization code making fixes and improvements. I've talked with Jim about this a little on Twitter, and have made my code available over at https://github.com/rickbrew/PaintDotNet.Quantization

The most important part of my findings are in sections 1 and 7 of the README over at that repo. Section 1 is just a simple fix for the Octree binning (ye ol' MSDN article bugs). It will result in more memory usage and CPU time, but should improve the quality/correctness of the palette.

Section 7 doesn't yet have a writeup as I have some other demands on my time right now. The tl;dr is that I've found a better way to map colors to palette entries that is faster and uses less memory. The code is at https://github.com/rickbrew/PaintDotNet.Quantization/blob/main/PaintDotNet/Imaging/Quantization/ProximityPaletteMap.cs and a discussion is at https://twitter.com/rickbrewPDN/status/1379238853832155136 . Pictures are really needed to properly explain the algorithm and hopefully I'll get to all that once I'm done with taxes and some other things. It should be easy to integrate into ImageSharp and/or PhotoSauce.

Benchmark results, in the PDN codebase at least, are very strongly in favor of the new "proximity map" code. Memory use is about 1MB max, as compared to basically unbounded for the linear search + Dictionary cache approach. The code isn't even that complicated, and further experimentation -- in search of an even better result -- should be straightforward.

https://twitter.com/rickbrewPDN/status/1379235212366737410

saucecontrol commented 3 years ago

Wow, great to see so much work going on in this area! I'll give your project a look for sure. When I was chatting with James on Discord, he shared the tweet where you pointed out the shift math error in that MSDN code. That's certainly contributing to the ImageSharp palette quality today.

I've got my quantizer down to a max of 87KB of memory now, and I built an AVX2 implementation of nearest distance match to a palette color to make brute-force search through the whole palette fast when filling in missing nodes in my octree. I subsample large images to a max of 4 megapixels when building the histogram, so those aren't directly comparable, but here's what my numbers look like on the worst-case image (all 24 bit colors), mapped to 5 bits of accuracy on my 4 year old laptop 😄 :

saucecontrol commented 3 years ago

BTW, @rickbrew, since you're in the high-perf C# clan, I'd suggest you join us over in the #lowlevel channel on the C# Discord. Tons of good chat and info on there.

JimBobSquarePants commented 3 years ago

Pop quiz. @saucecontrol What does your Octree output for this image look like?

blur

I notice that @rickbrew encoding this image with Paint.NET leads to blending with a white background. blur-pdn

My local Octree Quantizer currently produces which simply drops the alpha component. QuantizeImageShouldPreserveMaximumColorPrecision_Rgba32_blur_Octree

While Wu can deliver far better since it can handle multiple transparent colors (which png can support). QuantizeImageShouldPreserveMaximumColorPrecision_Rgba32_blur_Wu

saucecontrol commented 3 years ago

Mine currently outputs this, because it has a hard-coded alpha threshold of 33.33% (85/255 to be precise). magicscaler

With a lower threshold , it would look more like your octree output. I currently don't have a way to control that because I didn't build a proper configuration API for codecs -- you can really only configure JPEG in the current API. I'm presently working on fixing that, at which point I'll also allow matting against a bg color, along the lines of what PDN is doing there (with the color configurable, ofc). Those two configs combined are the best solution for GIF, same as Photoshop, et. al. give you.

I thought a bit more about using the octree for variable transparency like PNG supports. My original thought was to extend the octree to a hexadecitree, but that really expands the memory requirements, and I think it might be better to handle alpha completely independently so that you can limit the number of partially transparent colors when they appear only on edges. That's on my list to explore when I get more time.

JimBobSquarePants commented 3 years ago

Thanks @saucecontrol

I could, fairly easily, build background blending into the current architecture but I think I'll leave that for the future. I'm going to be pushing a PR very soon that improves our memory usage though.

saucecontrol commented 3 years ago

Cool, looking forward to seeing what you've come up with!

For reference, here's what my quantizer does with 1/255 alpha threshold

lowthresh

And matted against white. It looks lighter than the above browser-blended and PDN output because I do linear light blending with background when matting. Not so great in this case but avoids the red+green=brown problem.

whitebg

Just need to get that config API built... :)

(Edit: updated first image to match ImageSharp's threshold value)

SixLabors / ImageSharp

Quantization: reduce memory footprint without compromising speed and quality #1350