fcorbelli / zpaqfranz

Deduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix
MIT License
279 stars 25 forks source link

Linear writing of the extracted files. #135

Closed mirogeorg closed 1 month ago

mirogeorg commented 1 month ago

Franco, when extracting with ZPAQFRANZ, it reads the .ZPAQ archive sequentially, but writes and updates the extracted files non-sequentially. This results in a lot of read and write operations, making the disk the bottleneck.

Is it possible to make it write the extracted files sequentially while reading the .ZPAQ archive non-sequentially? It's clear that in this case, the same data might be read and extracted multiple times. This is how VEEAM extract filles.

This problem refers to the extraction of large files, where it is particularly pronounced.

fcorbelli commented 1 month ago

The short answer is OF COURSE NOT for zpaq Due to the multiple versione you can't (spiegone already on encode.ru)

With zpaqfranz YES if you have enough free RAM (aka: bigger than the biggest file to extract) or YES if you almost turn off the deduplicator (the -stdout switch)

Do you really want the spiegones? 😄

fcorbelli commented 1 month ago

PS no, the same data is not readed more then once, because there aren't same data in zpaq archives LSS zpaq's extraction is more SSD friendly, even zfs friendly for skipping over zeros Because even zero matters

mirogeorg commented 1 month ago

I understand. What is LSS extraction?

fcorbelli commented 1 month ago

Long Story Short You can use -ramdisk (on zpaqfranz) if you have enough RAM Just now I am refining the output of this switch

fcorbelli commented 1 month ago

Please try 60.8a (via an upgrade) with the -ramdisk switch If you have enough RAM the output will be sequential.
Incidentally, it also operates a check on file hashes, making additional tests unnecessary

It is a switch that already existed but showed little detailed information.
Needs to be refined, especially moving the warning for the availability of “true” RAM. Let me know what you think (obviously it is slower than a solid-state drive extraction, compared to normal extraction)

fcorbelli commented 1 month ago

PS this is just a pre-release under development...

mirogeorg commented 1 month ago

Franco, what was the -ramdisk option originally designed for? In my case, it won't be suitable because the VM disks are huge.

In general, it's hard to determine before the actual extraction whether the memory will be enough, and that's why I'm curious about what this option was originally designed for.

fcorbelli commented 1 month ago

Here's the translation of the text into English:

"It's used for chunk extraction on a RAMdisk. The w command. I never finished it for the 'normal' x command (extraction), maybe I'll work on it a bit. If you have a machine with a large enough amount of RAM and not huge files, you can perform an extraction with sequential writing, directly checking the hashes. This is something that's not possible to do, except with the p command, which is a merge of unzpaq206. However, it’s single-threaded and therefore incredibly slow. If the file size is larger than the available RAM, there’s nothing you can do. Mount solid-state drives or wait for it to finish.

Here's the translation into English:

"Obviously, by RAMdisk I simply mean RAM allocated in the computer, so it also technically includes the swapfile. I had to make a considerable effort to understand what is 'real' RAM (as opposed to virtual memory). Translation: if you want to compare the HASH (not the CRC-32) of the files inside ZPAQ, you have no choice but to extract them. After extracting them, you can calculate the HASH and compare it to the stored one. If you don't want to extract them (and you can do it), you use the 'ramdisk'. The files are extracted into RAM, and from there, you calculate the HASH (and maybe in the future, you write them to disk sequentially)."

fcorbelli commented 1 month ago

hdd

The new -hdd switch (available with upgrade -force) will use (if possible) the available RAM for a sequential-output In this example, using a very old external HDD, a small vmdk is extracted much faster

BTW this will recalculated "on the fly" the hash of the output files

mirogeorg commented 1 month ago

This option is good. However, I couldn't test it because the file sizes I work with are large. But overall, it's an interesting option. If it's a new option, the name seems to me not very intuitive. Maybe something like -UseRAM, -UseRAMBuffer, or -TryExtractToRAMFirst would be better.

As for the wiki - Due to deduplication, ZPAQ extracts files non-linearly, which leads to a large number of random read and write requests. This results in higher disk load. If sufficient RAM is available, the extraction will take place in a memory buffer, and only then will the files be written sequentially to the disk at maximum speed. If such memory is not available, the extraction will automatically be performed without acceleration.

fcorbelli commented 1 month ago

I can use as a "swapfile" local SSD, if any Aka: extraction to SSD, then copy to HDD This should be useful only for mixed servers, like OVHs Not sure this makes sense

On Windows the virtual RAM is usually quite big, up to the free boot drive space (by default) Then on a NVMe C: you can get 200 or more GB of pagefile.sys, even on consumer configuration Ironically on *nix the swap partition is usually small, like 16GB

If you put a swapfile on big SSD you can extract on "fake" RAM even very big vmdk

Maybe I'll do some test

mirogeorg commented 1 month ago

Was your previous test of the -hdd option on an HDD? I work and test entirely on SSD.

An interesting solution to the problem would be the ability to decompress only a specific part of a file. For example, a specific file offset from 0 to 1 MB, and accordingly, read all the pieces that the algorithm needs to decompress only this fragment of the file.

Currently, as far as I understand, the logic is the opposite — the archive is read, and the information is distributed to where it belongs.

fcorbelli commented 1 month ago

well... do you really, really want the "spiegone"? 😄

fcorbelli commented 1 month ago

None of the options you mentioned are feasible. In broad terms, the extraction procedure involves reading the archive in a first pass, particularly focusing on the index blocks (i), which essentially means the file names with the list of fragments. During this initial phase, the data blocks are skipped with a SEEK using the C blocks. After this first "scan" of the archive, the files to be restored are marked, creating a vector of blocks, of type Block. At this point, the main thread creates N decompression threads, which are launched in parallel (until a few versions ago, N was by default the number of HT cores, in recent versions it should be the number of physical cores). The decompression threads, in turn, open the zpaq archive, which is typically cached by the internal buffering system (!), or not, depending. The decompression thread unpacks the block and then, for each file that points to that block, positions itself on it and writes it to the disk. In reality, it does a thousand other things. Additionally, in the case of ordered fragments, they are merged together for a single write to disk, although this doesn’t happen if they are composed of zeros. This is, broadly speaking, the description of the decompression procedure. I can do like ChatGPT: do you want a more detailed explanation? Beyond the details, the key point is that extracting a file (let's assume we have a single file for simplicity) starts with writing a file to disk made entirely of zeros (as mentioned above, the zeros are NOT written, nor stored). This phase can take quite a long time (imagine creating a 100GB file full of zeros). At that point, it will begin to decode the blocks containing the data (non-zero ones), decompress them, SEEK within the zero-filled file, and write the blocks (there's an internal caching mechanism, but never mind). If you’re wondering how it manages to write while operating with multiple threads, it’s simple. Inside a higher-level structure, the extraction job, there’s a mutex (write_mutex) that ensures the threads write one at a time. In short, during the extraction, "all sorts of things happen." The matter is even more complex because there’s also the management of file extraction in streamed format, the primitive one used by early zpaq versions. Single-threaded extraction is much simpler (like unzpaq.cpp does, if you want to study it) BUT it has the "tiny" problem that files are extracted into RAM before being written. Essentially, it reads all the data fragments and decompresses them sequentially. This means (for example, for the p command) that it consumes as much RAM as the decompressed files are large. Is that enough, or should I go on?

fcorbelli commented 1 month ago

A bit more here https://encode.su/threads/456-zpaq-updates?p=81360&viewfull=1#post81360

fcorbelli commented 1 month ago

Anyway, let me explain it in a hopefully more understandable way with an example. Let’s suppose we store a file called pippo.vmdk in version 1. This will be written into the zpaq archive, and we won't worry about how for now. Let’s assume the last byte is 150. Now, we make a small change right at the end of the pippo.vmdk file, changing the last byte from 150 to 27, and we store it again in version 2. The .zpaq file will have an initial part where the last byte of the pippo.vmdk file is 150, and then another fragment where the last byte is 27. Now, let’s change the FIRST byte of pippo.vmdk and once again update the .zpaq archive with the third version. When I need to extract the third version of pippo.vmdk, I’ll have to perform writes with SEEKs. First, sequentially (that is, from the beginning to the end), then — once I get to the third version — I’ll need to SEEK back to the start of the file to change the first byte. This is why, in general, WRITING involves SEEKs. The zpaq archive is written sequentially, from the first version to the next ones. BUT a single file can be composed of fragments that are NOT in sequence. You can use the dump -all command on small files.

This is a real-world example of a file (va.cpp) with UNORDERED fragments

2024-02-25 19:32:08 va.cpp  |XXHASH64|  FRAGS (#54):  110 4 73 74 7-11 69 13-16 84 85 19-23 111 25 112 27-34 87 36-38 113 114 41-46 115 48-52 116 54 117 109

To extract it, you will need to write, in sequence, fragment 110, then 4, then 73, 74, from 7 to 11, and so on. However, these fragments are (in general) all in different positions within the output file, requiring a SEEK for each fragment (or almost). In some cases, there are only a few SEEKs, while in others, there are many. There is no way to know the list of fragments before reading the corresponding i block of the version (typically the latest one) from which you want to extract the data.

Zpaq, as mentioned, does NOT work this way. It does NOT perform multiple SEEKs INSIDE the .zpaq file to read the fragments in sequence and then write them one by one into the files. It works the other way around. It reads the fragments in sequence: 1, 2, 3, 4... When it reads fragment 4, it "understands" that it needs to write it inside va.cpp in a specific place (it's the second fragment). Initially, I remind you, va.cpp is entirely made of zeros. Then it continues reading 5...6...7. After reading fragment 7, it "understands" that this one also goes inside va.cpp, then 8 up to 11. Then it reads fragment 12, but does nothing with it. Then 13, and again it "understands" that it needs to write it inside va.zpaq, up to 16. Then 17, 18 (does nothing) and arrives at 19, which needs to be written. It repeats this way.

fcorbelli commented 1 month ago

I hope you're starting to understand why it’s not so simple to describe in detail what happens, even using ChatGPT, and how much time and attention it requires. And this is the SIMPLIFIED version of what "actually" happens 😄

fcorbelli commented 1 month ago

Additional bonus: zpaqfranz has a specific mode to avoid all of this, and it’s activated with the -stdout switch (in a) (it doesn’t exist in zpaq).

C:\zpaqfranz>zpaqfranz a z:\ordered *.txt -stdout -nodedup
C:\zpaqfranz>zpaqfranz a z:\unordered *.txt 

Then look at the ugo.txt

C:\zpaqfranz>zpaqfranz dump z:\unordered.zpaq -stdout -all |grep ugo.txt
2024-04-19 13:27:48 ugo.txt  |XXHASH64B|  FRAGS (#54):  63-88 34 89 36-38 90-96 46 47 97 49-51 98-107

C:\zpaqfranz>zpaqfranz dump z:\ordered.zpaq -stdout -all |grep ugo.txt
2024-04-19 13:27:48 ugo.txt  |XXHASH64B|  FRAGS (#54):  63-116

BUT... this will DISABLE the deduplicator!

fcorbelli commented 1 month ago

short version: try -stdout -nodedup

This should speed up (a lot) the extraction (if you extract very often) but with a larger output archive, make sense only if a single version is used