evaluate article about xz/lzma in context of borg backup

ThomasWaldmann commented 6 years ago

http://www.nongnu.org/lzip/xz_inadequate.html

ThomasWaldmann commented 6 years ago

borg uses lzma as offered by Python's lzma standard library module IF you use -C lzma.

https://docs.python.org/3/library/lzma.html

https://github.com/borgbackup/borg/blob/1.1.5/src/borg/compress.pyx#L190 (note: FORMAT_XZ (default) and CHECK_NONE)

All comments in the article about bad/missing error (or tampering) detection by xz/lzma are somehow irrelevant for borg because borg usually first authenticates data, then decrypts, then decompresses. Any random or abusive data modification would be detected and authentication would fail and it would not even try to decrypt or decompress in that case. This is the reason why we give CHECK_NONE to lzma.

Only exception is when you use borg without authentication, but even then we still have the content hash and a crc32 on the storage layer.

ThomasWaldmann commented 6 years ago

(some of the points raised in that article, e.g. about error checking, seem to be interesting in case we ever redesign the borg repo data structures, thus I am labelling this for repository and breaking)

ThomasWaldmann commented 2 years ago

It looks like FORMAT_RAW + CHECK_NONE would be adquate for borg. Not sure about the filters / other params.

Considering we already have --compression=lzma for FORMAT_XZ, guess we would call that --compression=lzma1.

Needs some experimenting whether the saved overhead is worth it.

We need to keep the --compression=lzma so that already compressed data still works (and does not need to get recompressed).

@elho any opinion about this?

ThomasWaldmann commented 2 years ago

https://stackoverflow.com/questions/3057171/lzma-compression-settings-details

https://sevenzip.osdn.jp/chm/cmdline/switches/method.htm

elho commented 2 years ago

Uh, I vaguely remember at some point in time having read the article outlining the inadequacies of xz without any relation to borg, taking away that lzma is the better choice and staying away from xz on the command line.

Python's lzma Moule with the synopsis "Compression using the LZMA algorithm" in reality wrapping data into xz containers by default stilll comes as a bad surprise.

It looks like FORMAT_RAW + CHECK_NONE would be adquate for borg. Not sure about the filters / other params.

Yes, the former seems to be the way to go. The parameters allow for a lot of fiddling, given the upper bound of chunk sizes, there may even be room for optimization. Either some experimental code in a separate branch to play with those or even something along the lines of pngcush -brute to run over some small (enough to be feasible) repos to get an idea of what could work.

As for the name to use with the --compression parameter, I would probably prefer rlzma or lzmar to reflect the RAW format prart and not end up with users confusing the old lzma for LZMA2 given the other is called lzma1. I guess a borg 3.0 at latest could treat lzma as the new variant again. :slightly_smiling_face:

elho commented 2 years ago

Above approach with a "new" compressor allows recreate --recompress to work without any modification to upgrade a repository from XZ to RAW format LZMA. That however also means fully de- and then recompressing each chunk.

We need to keep the --compression=lzma so that already compressed data still works (and does not need to get recompressed).

With above approach yes, ~but in general no, not necessarily~.

Edit: Scratch the latter, even though lzma.decompress defaults to FORMAT_AUTO, the latter unfortunately can not handle the RAW format according to the documentation. So the hope for existing borg 1.x being able to handle a simple switch in format in the same lzma compressor dies at that point. :slightly_frowning_face:

Still, even with needing two different borgcompressors for the formats, a recreate --recompress that switches any XZ LZMA data to RAW without touching the actual compression, but just unwrapping the XZ container around the RAW data would be thinkable.

ThomasWaldmann commented 2 years ago

Note: Cheaterman on IRC noted that lzma -1 (== lzma, level 1) is quite nice.

Cheaterman commented 2 years ago

The reason I noted that BTW: https://catchchallenger.first-world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO

Feel free to make your own conclusions of course! Mine was simply that lzma -1 seems to take impressively little time compared to the pretty nice compression ratio it yields. :-)

ThomasWaldmann commented 1 year ago

Note: in case we want to change anything about how we deal with lzma compression, it has to be finished by 2.0rc1 at the latest.

borg2 is breaking repo compatibility, so archives need to be transferred from old repos to new repos. that's the only opportunity when we could change compression-related details.

otherwise: just close this at 2.0rc1.

borgbackup / borg

evaluate article about xz/lzma in context of borg backup #3813