Feature request: 2-pass/block-based archive mode.

ghost commented 3 months ago

Long time ZPAQ user here.

Since zpaq already split the data in blocks and compress them separately, it would be good to keep the blocks in separate files. It would be similar to borgbackup, but without dedicated repo folders that hold multiple archives.

Details: Compression pass 1: De-duplicate the data from the source, and sort them to blocks uncompressed (based on the -mxx block size specified, does not have to be an exact size like in #75) stored in separate files, and an index storing which files are in which blocks. The output would be a single ZPAQ index file (output.zpaq.index) and a directory storing the blocks (like output.zpaq.d/12345.block). Compression pass 2: Compress each blocks in the directory in streamed mode, in parallel. There would also be a subcommand to directly compress those files regardless of whether they are in an archive (also doubles as a traditional compression tool like gzip). Decompression: Find the required blocks in the index to decompress based on the files specified to extract. If every block is there, decompress in parallel. If there are blocks missing, print the missing ones and exit.

Justification: I make ZPAQ archives ranging from ~50GB to ~2TB per archive. Compression takes days or even weeks on my work-horse machine, but I only need to compress once. I extract one or 2 of the files from the archive sometimes. The benefit would be that after performing pass 1, the archive is usable right away. Since each block is separately compressed, I can just compress some of the files at one time, in case I need to reboot the computer. Also, I can copy the files to different computers and compress them in parallel (among computers). Another benefit is that ZPAQ can dynamically adjust the threads used based on the available RAM (if out of memory, only one block is aborted instead of the whole archive). A third benefit is that it is easier to write them to BluRay discs/LTO tapes, as I do not have to store everyting in one place, or use double the space to store a manually splitted version.

I posted the idea in the encode forums and got rejected. It would be kind to consider the idea again.

fcorbelli commented 3 months ago

It means breaking backward compatibility with zpaq. Which I'm not going to do (I'd have 100 things to change that I'd need a lot more, before something like that) I also oppose, when not strictly necessary, the use of archives split over multiple files. They are fragile, there is nothing to be done about it. I wrote the backup and testbackup command pair to mitigate the problem

ghost commented 3 months ago

breaking backward compatibility with zpaq

I agree. However, the ZPAQ archive is practically 4 kinds of blocks (c, d, h, and i) concatenated together. Read the blocks in sequence and you get the ZPAQ archive. It is like the fixed-size split archives, but in my case the size of the file is one block long, the i blocks are stored in one file (maybe I could give up on that), and the d blocks are compressed with -m0. Compressing it "recompress" it to -m5; the uncompressed block size is the same. The responsibility for ensuring the blocks match is in the operator.

They are fragile, there is nothing to be done about it.

For smaller ones maybe, but in my experience, one single multi-terabyte file is more fragile, especially when trying to compress or update it in one attempt and the power is cut (happened to me many times, due to the filesystem truncating it), so I would argue that in my case storing the blocks separately is necessary to minimize damage, as a partial archive is still useful to recover some of the files.

fcorbelli commented 3 months ago

For smaller ones maybe, but in my experience, one single multi-terabyte file is more fragile, especially when trying to compress or update it in one attempt and the power is cut (happened to me many times, due to the filesystem truncating it), so I would argue that in my case storing the blocks separately is necessary to minimize damage, as a partial archive is still useful to recover some of the files.

Well, no An incomplete transaction will be discarded on the very next update

If you want to freeze and resume zpaq operation you can always use a virtual machine and suspend it

ghost commented 3 months ago

An incomplete transaction will be discarded on the very next update

Yes, but with the "transaction" taking multiple weeks (for the initial archive) or multiple days (updates to it), you lost everything compared to just one (or a few) blocks. (It is the same with xz but I use ZPAQ specifically for the ability to extract files without extracting the entire archive, plus dedup) I have a UPS but that only lasts a few minutes to cancel and do an fsync+unmount to the drive (I put it in a script). In one case it is not enough and I lost the entire archive file as it was truncated to 0 bytes for some reason.

Use a virtual machine and suspend it.

Tried with QEMU. On top of the overheads it did not work very well. I also tried criu and it may work based on how lucky the PID number is. In fact this is my main compliant with ZPAQ, as to achieve the maximum space saving, I need to deduplicate and compress as much as possible in one go.

ghost commented 3 months ago

Some clarification: The split files are single blocks (all 4 types of them) described in the section 8 of the spec. Pass 1 creates a ZPAQ archive with -m0xx but with the blocks stored as separate files. Pass 2 compresses the blocks and replace the -m0xx d blocks with -m5xx ones. You can get the "traditional" ZPAQ file by combining the blocks to a large file. The "index" file I have mentioned in the first post is an additional copy of the c/h/i blocks in one file (as mentioned at the end of section 8). Since the c, h, and i blocks are already in the output folder, the index is optional and is for the case where only part of the archive is available (e. g. archives spanning multiple disks or computers, with a complete index on one (or all) of the disks, so you know which disk you have to take out based on the "missing blocks" output when you try extracting).

fcorbelli commented 3 months ago

This seems quite an overkill to gain some bytes with the placebo-level compression Use a more reliable hypervisor (VMware for example) if you really want to suspend and restart, and that's it

ghost commented 3 months ago

gain some bytes with the placebo-level compression

For ~500G of uncompressed raw DNG photos (can compress, but cannot dedupe), ZPAQ with -m59 is around 6% smaller than xz -k --lzma2=dict=1610612736,mf=bt4,mode=normal,nice=273,depth=4294967295 (the maximum that can be specified by xz), so I do not consider that as "placebo-level". With dedupe-able files, the advantage would even be higher. The larger the source size, the more ZPAQ saves.

Even without the time-v.-size argument, the "split blocks" format allows more flexibility as I have mentioned above (partial archives, easy to fill among disks, do not have to deal with multi-terabyte files, etc.).

I have used ZPAQ for years to archive data, and I believe that my suggestion fixes all the "pain points" I have encountered.

VMWare for example

I do not feel lucky enough for dkms.

fcorbelli commented 3 months ago

6% does not seems a big gain I am quite confident that the cost in time, electricity and heating of saving 30GB is not exactly worth it. About 5 euro

However, I really don't think I will do such work. It's difficult, time consuming, and would be used by a single user in the world. But not me 😄

ghost commented 3 months ago

That is fine. I am closing the issue.

fcorbelli commented 3 months ago

Your request is legit, but way too complex. Sorry

fcorbelli / zpaqfranz

Feature request: 2-pass/block-based archive mode. #119