Closed mirogeorg closed 1 month ago
The short answer is OF COURSE NOT for zpaq Due to the multiple versione you can't (spiegone already on encode.ru)
With zpaqfranz YES if you have enough free RAM (aka: bigger than the biggest file to extract) or YES if you almost turn off the deduplicator (the -stdout switch)
Do you really want the spiegones? 😄
PS no, the same data is not readed more then once, because there aren't same data in zpaq archives LSS zpaq's extraction is more SSD friendly, even zfs friendly for skipping over zeros Because even zero matters
I understand. What is LSS extraction?
Long Story Short You can use -ramdisk (on zpaqfranz) if you have enough RAM Just now I am refining the output of this switch
Please try 60.8a (via an upgrade) with the -ramdisk switch
If you have enough RAM the output will be sequential.
Incidentally, it also operates a check on file hashes, making additional tests unnecessary
It is a switch that already existed but showed little detailed information.
Needs to be refined, especially moving the warning for the availability of “true” RAM.
Let me know what you think (obviously it is slower than a solid-state drive extraction, compared to normal extraction)
PS this is just a pre-release under development...
Franco, what was the -ramdisk option originally designed for? In my case, it won't be suitable because the VM disks are huge.
In general, it's hard to determine before the actual extraction whether the memory will be enough, and that's why I'm curious about what this option was originally designed for.
Here's the translation of the text into English:
"It's used for chunk extraction on a RAMdisk. The w
command. I never finished it for the 'normal' x
command (extraction), maybe I'll work on it a bit. If you have a machine with a large enough amount of RAM and not huge files, you can perform an extraction with sequential writing, directly checking the hashes. This is something that's not possible to do, except with the p
command, which is a merge of unzpaq206. However, it’s single-threaded and therefore incredibly slow. If the file size is larger than the available RAM, there’s nothing you can do. Mount solid-state drives or wait for it to finish.
Here's the translation into English:
"Obviously, by RAMdisk I simply mean RAM allocated in the computer, so it also technically includes the swapfile. I had to make a considerable effort to understand what is 'real' RAM (as opposed to virtual memory). Translation: if you want to compare the HASH (not the CRC-32) of the files inside ZPAQ, you have no choice but to extract them. After extracting them, you can calculate the HASH and compare it to the stored one. If you don't want to extract them (and you can do it), you use the 'ramdisk'. The files are extracted into RAM, and from there, you calculate the HASH (and maybe in the future, you write them to disk sequentially)."
The new -hdd switch (available with upgrade -force) will use (if possible) the available RAM for a sequential-output In this example, using a very old external HDD, a small vmdk is extracted much faster
BTW this will recalculated "on the fly" the hash of the output files
This option is good. However, I couldn't test it because the file sizes I work with are large. But overall, it's an interesting option. If it's a new option, the name seems to me not very intuitive. Maybe something like -UseRAM, -UseRAMBuffer, or -TryExtractToRAMFirst would be better.
As for the wiki - Due to deduplication, ZPAQ extracts files non-linearly, which leads to a large number of random read and write requests. This results in higher disk load. If sufficient RAM is available, the extraction will take place in a memory buffer, and only then will the files be written sequentially to the disk at maximum speed. If such memory is not available, the extraction will automatically be performed without acceleration.
I can use as a "swapfile" local SSD, if any Aka: extraction to SSD, then copy to HDD This should be useful only for mixed servers, like OVHs Not sure this makes sense
On Windows the virtual RAM is usually quite big, up to the free boot drive space (by default) Then on a NVMe C: you can get 200 or more GB of pagefile.sys, even on consumer configuration Ironically on *nix the swap partition is usually small, like 16GB
If you put a swapfile on big SSD you can extract on "fake" RAM even very big vmdk
Maybe I'll do some test
Was your previous test of the -hdd option on an HDD? I work and test entirely on SSD.
An interesting solution to the problem would be the ability to decompress only a specific part of a file. For example, a specific file offset from 0 to 1 MB, and accordingly, read all the pieces that the algorithm needs to decompress only this fragment of the file.
Currently, as far as I understand, the logic is the opposite — the archive is read, and the information is distributed to where it belongs.
well... do you really, really want the "spiegone"? 😄
None of the options you mentioned are feasible. In broad terms, the extraction procedure involves reading the archive in a first pass, particularly focusing on the index blocks (i), which essentially means the file names with the list of fragments. During this initial phase, the data blocks are skipped with a SEEK using the C blocks. After this first "scan" of the archive, the files to be restored are marked, creating a vector of blocks, of type Block. At this point, the main thread creates N decompression threads, which are launched in parallel (until a few versions ago, N was by default the number of HT cores, in recent versions it should be the number of physical cores). The decompression threads, in turn, open the zpaq archive, which is typically cached by the internal buffering system (!), or not, depending. The decompression thread unpacks the block and then, for each file that points to that block, positions itself on it and writes it to the disk. In reality, it does a thousand other things. Additionally, in the case of ordered fragments, they are merged together for a single write to disk, although this doesn’t happen if they are composed of zeros. This is, broadly speaking, the description of the decompression procedure. I can do like ChatGPT: do you want a more detailed explanation? Beyond the details, the key point is that extracting a file (let's assume we have a single file for simplicity) starts with writing a file to disk made entirely of zeros (as mentioned above, the zeros are NOT written, nor stored). This phase can take quite a long time (imagine creating a 100GB file full of zeros). At that point, it will begin to decode the blocks containing the data (non-zero ones), decompress them, SEEK within the zero-filled file, and write the blocks (there's an internal caching mechanism, but never mind). If you’re wondering how it manages to write while operating with multiple threads, it’s simple. Inside a higher-level structure, the extraction job, there’s a mutex (write_mutex) that ensures the threads write one at a time. In short, during the extraction, "all sorts of things happen." The matter is even more complex because there’s also the management of file extraction in streamed format, the primitive one used by early zpaq versions. Single-threaded extraction is much simpler (like unzpaq.cpp does, if you want to study it) BUT it has the "tiny" problem that files are extracted into RAM before being written. Essentially, it reads all the data fragments and decompresses them sequentially. This means (for example, for the p command) that it consumes as much RAM as the decompressed files are large. Is that enough, or should I go on?
Anyway, let me explain it in a hopefully more understandable way with an example. Let’s suppose we store a file called pippo.vmdk in version 1. This will be written into the zpaq archive, and we won't worry about how for now. Let’s assume the last byte is 150. Now, we make a small change right at the end of the pippo.vmdk file, changing the last byte from 150 to 27, and we store it again in version 2. The .zpaq file will have an initial part where the last byte of the pippo.vmdk file is 150, and then another fragment where the last byte is 27. Now, let’s change the FIRST byte of pippo.vmdk and once again update the .zpaq archive with the third version. When I need to extract the third version of pippo.vmdk, I’ll have to perform writes with SEEKs. First, sequentially (that is, from the beginning to the end), then — once I get to the third version — I’ll need to SEEK back to the start of the file to change the first byte. This is why, in general, WRITING involves SEEKs. The zpaq archive is written sequentially, from the first version to the next ones. BUT a single file can be composed of fragments that are NOT in sequence. You can use the dump -all
command on small files.
This is a real-world example of a file (va.cpp) with UNORDERED fragments
2024-02-25 19:32:08 va.cpp |XXHASH64| FRAGS (#54): 110 4 73 74 7-11 69 13-16 84 85 19-23 111 25 112 27-34 87 36-38 113 114 41-46 115 48-52 116 54 117 109
To extract it, you will need to write, in sequence, fragment 110, then 4, then 73, 74, from 7 to 11, and so on. However, these fragments are (in general) all in different positions within the output file, requiring a SEEK for each fragment (or almost). In some cases, there are only a few SEEKs, while in others, there are many. There is no way to know the list of fragments before reading the corresponding i block of the version (typically the latest one) from which you want to extract the data.
Zpaq, as mentioned, does NOT work this way. It does NOT perform multiple SEEKs INSIDE the .zpaq file to read the fragments in sequence and then write them one by one into the files. It works the other way around. It reads the fragments in sequence: 1, 2, 3, 4... When it reads fragment 4, it "understands" that it needs to write it inside va.cpp in a specific place (it's the second fragment). Initially, I remind you, va.cpp is entirely made of zeros. Then it continues reading 5...6...7. After reading fragment 7, it "understands" that this one also goes inside va.cpp, then 8 up to 11. Then it reads fragment 12, but does nothing with it. Then 13, and again it "understands" that it needs to write it inside va.zpaq, up to 16. Then 17, 18 (does nothing) and arrives at 19, which needs to be written. It repeats this way.
I hope you're starting to understand why it’s not so simple to describe in detail what happens, even using ChatGPT, and how much time and attention it requires. And this is the SIMPLIFIED version of what "actually" happens 😄
Additional bonus: zpaqfranz has a specific mode to avoid all of this, and it’s activated with the -stdout switch (in a) (it doesn’t exist in zpaq).
C:\zpaqfranz>zpaqfranz a z:\ordered *.txt -stdout -nodedup
C:\zpaqfranz>zpaqfranz a z:\unordered *.txt
Then look at the ugo.txt
C:\zpaqfranz>zpaqfranz dump z:\unordered.zpaq -stdout -all |grep ugo.txt
2024-04-19 13:27:48 ugo.txt |XXHASH64B| FRAGS (#54): 63-88 34 89 36-38 90-96 46 47 97 49-51 98-107
C:\zpaqfranz>zpaqfranz dump z:\ordered.zpaq -stdout -all |grep ugo.txt
2024-04-19 13:27:48 ugo.txt |XXHASH64B| FRAGS (#54): 63-116
This should speed up (a lot) the extraction (if you extract very often) but with a larger output archive, make sense only if a single version is used
Franco, when extracting with ZPAQFRANZ, it reads the .ZPAQ archive sequentially, but writes and updates the extracted files non-sequentially. This results in a lot of read and write operations, making the disk the bottleneck.
Is it possible to make it write the extracted files sequentially while reading the .ZPAQ archive non-sequentially? It's clear that in this case, the same data might be read and extracted multiple times. This is how VEEAM extract filles.
This problem refers to the extraction of large files, where it is particularly pronounced.