DB optimized for a bunch of PNG images.
The idea is to split PNG images into many blocks and have each block stored in a DB. If there are several equal blocks, it is only stored once. Via a hash table, the lookup for such blocks is made fast.
I am collecting screenshots (for several reasons; one is to play around with machine learning / computer vision; one example is here: https://github.com/albertz/screenshooting). A lot of them. :)
Right now, I have about 88k screenshots with about 77GB. And as many of them have a lot of repetitive areas (on some days, I were making a screenshot every 10 seconds, even when not using the computer at all, so the only changing part was the time display), I didn't wanted to waste so much space on so much repetitive data.
With this PNG DB, I have a compression rate of about 400-500% (for the first 1k screenshots or so; probably the rate will even be higher for all of them).
This example with the screenshots is probably an extreme case (where this applies extremely well). But I guess in many other cases where you are collecting a huge amount of PNG images (with computer-generated content; real-world images would not work that well), you can safe some space by it.
And if this gets optimized as far as possible, it may be even faster than normal filesystem access (because of less disk IO).
To make things easier on the PNG side, it just parses down until it gets a scanline serialization. Multiple directly following scanline (of same width) serializations parts build up a block (so it actually really matches a block in the real picture). But I don't do any of the PNG filtering. PNG spec: http://www.w3.org/TR/PNG/
The general DB layout is as follows:
Such data value (uncompressed) starts with a data-type-byte. Only 3 types are there currently:
There are multiple DB backend implementations:
In the beginning, I thought about using some generic image library to be able to handle just any image type and then operate just on the raw data. This would even give me some slight better compression rate because now, I am operating on PNGs scanline serializations and there are 5 different ways (filters) in PNG to represent a scanline.
However, because I am storing all the data as PNG raw data in the DB, the reconstruction of the PNG should be much faster. In the more generic case, I would have to recompress/reencode the PNG. Now I only have to (roughly) collect and glew the parts together and run the PNG zlib over it.
Using a general deduplicator / compressor on the raw data (uncompressed PNG, TGA or BMP): It would be based on connected chunks of data; i.e., in the image, it would mean one or many following scanlines. But what I am doing is based on rectangular blocks in the image. So I am able to get much bigger chunks of data which is repetitive.
Something like what is done in video compressions methods like H264: This might actually be a very good idea. And it should be possible to just add it now to my current method.
It comes with several tools. Some of them:
Just run ./compile.sh
.
For Mac: If you haven't MacFUSE installed, install it from here: http://code.google.com/p/macfuse/
-Albert Zeyer, http://www.az2000.de