dropbox / lepton

Lepton is a tool and file format for losslessly compressing JPEGs by an average of 22%.
https://blogs.dropbox.com/tech/2016/07/lepton-image-compression-saving-22-losslessly-from-images-at-15mbs/
Apache License 2.0
5.01k stars 355 forks source link

Compress without decompressing the whole JPEG #84

Open StephanBusch opened 7 years ago

StephanBusch commented 7 years ago

There must be some workaround for the high memory requirements Lepton has at present. I guess, solutions like PAQ and WinZip may be able to compress without having to decompress the whole JPEG in memory before. The memory requirement there is pretty much the same no matter how big the input file is.

In PAQ7 source, Matt wrote: "Files are further compressed by partially uncompressing back to the DCT coefficients to provide context for the next Huffman code."

Does anyone has an idea how this partially uncompressing could be made real in Lepton?

danielrh commented 7 years ago

Hi Stephan... actually it may work with some tiny tweaks right now you can already pass a command line flag -startbyte=X -trunc=Y it should only allocate memory needed to store the data between those offsets... (unless there's a bug) So if you were willing to split up your jpeg into N pieces with a lepton file for each chunk, it may work as is... of course ideally we could simply teach lepton about decoding those pieces without re-invoking it on each chunk.... but you could theoretically tar it up at that point.

The missing gaps here are a) ideally it would spit out the whole jpeg at once right now the algorithm does O(N^2) I/O for a file of N chunks (since you need to feed the data into N compression operations, one per chunk)--though it simply ignores the data outside of the range, it does need that data to track the bit offsets and pixel locations within the jpeg--of course JPEGs with restart markers would make it possible to do this rather qucikly

b) reassembling the file from the multiple pieces... we'd need some sort of meta-archive format to contain each piece of the file so it could be reassembled into a whole.

Dropbox doesn't need this technology because right now we always store data in chunks of no more than 4 megabytes. This means that no individual JPEG piece ever exceeds 4 MiB.