support content-type specific compression/decompression, like jpeg xl?

borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.

https://www.borgbackup.org/

Other

11.23k stars 743 forks source link

support content-type specific compression/decompression, like jpeg xl? #8092

Open ThomasWaldmann opened 9 months ago

ThomasWaldmann commented 9 months ago

Let's discuss here, whether / how borg could support this, assuming there is a jpeg xl library (with python / cython binding), that supports a bit-identical compression (transformation to jpeg xl format) and decompression (transformation back to the original file).

Notable:

borg usually works on CHUNKS (pieces of files, as output by the borg buzhash or fixed chunker): file data -> chunk -> compress -> encrypt/auth -> store
borg usually compresses all chunks in the same way, using the same compression algorithm, e.g. zstd or lz4
in the past we already did one try to implement file-type specific compression, but we abandoned that because of the configuration hassle and went with an "auto" compressor that is simpler to use and does not need configuration per file-type.

knutov commented 9 months ago

It's a good possibility to improve compression in some cases, but:

it requires a lot of cpu, so it should be definitely optional and disabled by default
there are seems to be a lot of task with higher priority, like stable v2 release

alexandervlpl commented 9 months ago

I didn't realize it was file data -> chunk -> compress, that rules out any simple implementation. JXL is not a compression algorithm like lz4 that takes any bytes you throw at it. If you're starting with a JPEG it needs a complete file with a header and all the pixels.

I could probably write a separate "chunker" not just for JPEG, but all image formats supported by Pillow. Split the image into raw tiles (chunks) of the size you need and then compress each chunk as a separate, lossless JXL image. There's a Pillow JXL plugin with lossless support. Additionally to achieve a bit-identical reversal of the entire process, the original image header (EXIF metadata, etc) will need to be stored in a separate chunk and reconstituted.

That's a lot of extra complexity.
@knutov in addition to CPU usage this will be very memory intensive for large images.

Seems like it's not worth it?

FabioPedretti commented 9 months ago

This discussion remembers me some of the arguments detailed here: https://www.nongnu.org/lzip/xz_inadequate.html (I read it years ago, don't remembers the details, but the point was to try to have simple formats for archiving data minimizing issues.)

ThomasWaldmann commented 9 months ago

If we don't come up with a good/easy solution, an alternate way to use jpeg xl is of course that the users convert their photos to that format at the primary storage location.

If there is an easy transformation back to the original format, that seems the better idea anyway because then it also uses less storage at the primary location. Only issue could be that the tools preferred by the users do not (yet) read/display that format.

alexandervlpl commented 9 months ago

an alternate way to use jpeg xl is of course that the users convert their photos to that format at the primary storage location.

That's what I do, I use the official CLI tools to encode/decode as needed before/after running borg.

Only issue could be that the tools preferred by the users do not (yet) read/display that format.

This is the real problem. Adoption has stalled, currently to browse thumbnails and open the images you pretty much need to be on Linux and you need to compile something like gThumb yourself. 0.000001% of users will do this and it looks like that won't change.

So I was hoping JXL can at least have a future as an archive format used internally by tools like borg. In my case it already saves me 50+GB of space and bandwidth, would be very useful to make that available to everyone.