fcorbelli / zpaqfranz

Deduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix
MIT License
279 stars 25 forks source link

Memory issue while add files #130

Closed Namke closed 1 month ago

Namke commented 2 months ago

Hello there

I got an issue while perform a command, program spawn "skipping block XXXX" at loading block phase then crash (probably terminated by Windows) when its consume all the memory (96GB).

However the same loading block phase for t command take only 39GB memory, and t command ran fine.

Reading block for add and or test should act them same and take similar amount of RAM, am I correct or the two behave differently and I definitely need more RAM for adding files?

fcorbelli commented 2 months ago

It is difficult to give a precise answer without knowing exactly the circumstances
You should also have information of the type e.what() (i.e., the exception)

Anyway, no. the memory used during the compression phase is not necessarily the same as the memory used in decompression (if I'm not mistaken I had mentioned a wiki page, I don't remember well). This one https://github.com/fcorbelli/zpaqfranz/wiki/Voodoo-zpaq-level-1-threads-and-RAM

You can try, if you really want, to REDUCE the number of threads used, with switch -t something. Because each compression thread uses (approximately, the issue is complex) the same amount of RAM

Namke commented 2 months ago

I always do add files with -t4 and -m3, even with -t2 its happen the same.

While doing add files, zpaq reach full memory then spawn a series of "skipping block XXX bad allocation" then naturally OS terminated it. Screenshot 2024-09-30 235035

While doing test, zpaq used about max 30-32GB then down to 15GB, the test return fine.

Not sure if amount of files were a problem, since I have a pack which smaller but much more in term of file count and its ran fine.

And all the crash happen while reading block, not when program scan files yet.

fcorbelli commented 2 months ago

How many files?

fcorbelli commented 2 months ago

You can try the attached pre-release With and without -frugal (rough extimantion of heap RAM is showed) 60_7m.zip

fcorbelli commented 1 month ago

You can try (via zpaqfranz upgrade -force) the newer pre-release 60_7o

mirogeorg commented 1 month ago

I'm getting the same error. The machine is 5950X and has 128GB of RAM, with about 80GB free.

commandline: \\bkpsrv\bkp\@AutoUpdate\ZPAQ\zpaqfranzhw a \\bkpsrv\bkp\srvInstall????.zpaq \\VMHOST3\d$\@GMT-2024.10.04-19.39.51\vmware\srvInstall\ -to "" -t8 -fragment 1 -not *swap*.* -not *.mct -not *.rct

Output:

zpaqfranz v60.7o-JIT-GUI-L,HW BLAKE3,SHA1,SFX64 v55.1,(2024-10-01)
franz:-to                   <<>>
franz:-threads                                  8
franz:-fragment                                 1
franz:-not                               *swap*.*
--------------------------------------------------------------------------------------------------------------------
franz:-not                               *swap*.*
franz:-not                                  *.mct
--------------------------------------------------------------------------------------------------------------------
franz:-not                               *swap*.*
franz:-not                                  *.mct
franz:-not                                  *.rct
--------------------------------------------------------------------------------------------------------------------

<<//bkpsrv/bkp/srvInstall????.zpaq>>:
Skipping block at 637663533142: index block requires too much memory
Skipping block at 639168790950: index block requires too much memory
Skipping block at 640641937744: index block requires too much memory
3 versions, 27 files, 641.822.763.773 bytes (597.74 GB)
00079! 36230: ERROR_NO_MORE_FILES       //bkpsrv/bkp/srvInstall0004.zpaq
Creating //bkpsrv/bkp/srvInstall0004.zpaq at offset 0 + 641.822.763.773
Add 2024-10-04 22:42:16         7  1.132.14.446.208 (   1.03 TB) 8T (3 dirs)
(061%)  98.77% 00:00:55  (   1.02 TB)->( 130.13 MB)=>( 131.75 MB)  238.19 MB/s
fcorbelli commented 1 month ago

Do not change fragment size

mirogeorg commented 1 month ago

Franco, do you want me to test with the standard fragment, or do you already know what the issue is and are certain it's because of that?

fcorbelli commented 1 month ago

Test with standard fragment size (or even bigger) Fragment 1 makes A LOT of fragments

mirogeorg commented 1 month ago

image

It seems like the memory usage hits a maximum somewhere around these gigabytes. This is during testing...

mirogeorg commented 1 month ago

First, I'd like to test again with fragment 1 to see what status ZPAQFRANZ will show in the end. Whether it will indicate that everything is OK.

It doesn't seem to crash, it creates an archive, and it looks like it's valid since it then creates parts 0002, 0003, and so on.

fcorbelli commented 1 month ago

Here’s a compelling example of documentation, so you can get an idea of the documentation problem. The short answer is, "don't change the default fragment size." That's it.
If you want to know what a fragment is, how it gets packaged into a block, and why you should or shouldn't choose certain sizes, you'd need a "spiegone mortale" spanning dozens of pages.

Would you read it, or would you stop at "don't change it"?

Would you really want to dive into all the nuances, or would you rather spend a couple of hours on more productive things?

Should I spend hours and hours not only explaining, but going into every tiny detail (the issue is quite complex)? Or would it be enough for me to just write "don't change it"?

fcorbelli commented 1 month ago

If you are brave, you can start here, https://encode.su/threads/456-zpaq-updates?p=29846&viewfull=1#post29846 reading for jidac

As you know, or maybe not, current zpaq is the "merge" of two very different "beasts", the proto-zpaq and jidac. You will find, in the source, that the "blob-object" is called... Jidac

😄

mirogeorg commented 1 month ago

Franco, I really appreciate details, but I don’t think the documentation mentions anywhere that this option shouldn’t be used. Maybe it would be good to include that.

I’m using -fragment -1 because it gives the smallest size for what I’m archiving. I also have plenty of memory. But it seems it’s not going to work as I thought. ;)

I'll read through what you suggest.

Here’s the result from the test. It doesn’t return an error, not even during testing... which doesn’t seem good at all.

image

I’m not sure if this is the right place for it, or if I should open a new issue, but what is this red message? It worries me because it doesn’t appear with all archives. What could be causing it?

image

fcorbelli commented 1 month ago

The issue with 'fragments' is quite simple. The smaller they are, the more numerous they become. No one is stopping you from changing that parameter, but it's up to you to choose the right one. If you're not 100% sure of what you're doing, don't change it. There's a pretty clear error message, "Index block requires too much memory." Maybe I should make it even clearer. Like, "your archive is screwed." In the second case, you have a transaction halfway through, and again, your archive is screwed.

fcorbelli commented 1 month ago

If you want to see exactly where the "index block" is greater than 1.5GB, use -debug.

fcorbelli commented 1 month ago

60_7q.zip Or use this pre-release

Another example of how it's not so straightforward to describe in detail what happens. During the first extraction phase (or list, or test, basically anything), let's say the archive metadata is read. This is done through a function (read_archive) that starts from the beginning and moves toward the end. It loads the various blocks, which contain different fragments. The blocks are compressed with zpaq (and here's where the potential problem arises). If the function estimates that decompressing a block will require more than 1.5GB, it shows an error message (now more specific, with additional debug information). Why does it do that? I don't know, Mahoney made it that way, maybe ask him :) I'd say it's essentially a limitation of the 32-bit version rather than the 64-bit one. Either way, this shouldn't happen. That is, a single block should NOT require too much RAM to be decompressed. Usually, this indicates a "screwed" block, meaning it contains corrupted data. Is it still possible to extract an archive, EVEN IF a single block requires more than 1.5GB? Who knows? You'll have to try. If the block isn't damaged (meaning corrupted, as in it hasn't accumulated garbage that "drives the decompression process crazy"), generally, it will still extract (on 64-bit systems). Always? Who knows. Who Can Say? The issue of incomplete transactions, however, requires its own explanation. Or did you think it was simple?

fcorbelli commented 1 month ago

Some notes on the zpaq format and transactions, as well as INCOMPLETE transactions. The format originated as a mix of two completely different formats: streaming and journaled. This implies a whole series of issues, but I won’t go into detail. The journaled one essentially (and obviously, it’s much more complex if it's encrypted, but again, I won’t elaborate, otherwise, you'd tell me the documentation is incomprehensible) starts with a sort of placeholder. That is, it’s written at the END of the archive, let's say a series of zeros (actually no, but let’s pretend they are). This is because zpaq DOESN'T KNOW how much data it will need to add. 1KB? 1MB? 1GB? It doesn’t know. So, it initially writes zero. Then the whole process starts, which gradually writes the various blocks of different types with various fragments, and so on. In the end, it determines that it has added, let's suppose, exactly 1,234,567 bytes. So, it will perform a SEEK backwards to the initial placeholder, where there were zeros (again, they aren’t really zeros... but that’s fine), and it will write 1,234,567. At this point, the archive update (or creation) process is complete.

fcorbelli commented 1 month ago

Alright, during the archive reading phase, there can be various scenarios. One is when you want to add data, and another is when you simply want to (for example) list the files in the archive. To avoid having to read the entire archive (which could be terabytes) just to list the files, zpaq reads the type C blocks (the ones mentioned earlier), from which it reads (in our example) 1,234,567, allowing it to SEEK FORWARD by 1,234,567 bytes to "skip" all the data it’s not interested in (again, it doesn’t exactly work like this, but that’s the general idea. If you want, I can give you an even more detailed explanation). So, in broad terms, you’ll have a series of "blocks" that tell you "the first transaction is 1,234,567 bytes long, skip all that and position yourself where you expect the SECOND transaction to be." If the corresponding block contains all zeros, it means that the SECOND transaction is incomplete. For example, you pressed Control-C while you were adding data, BEFORE the SEEK backwards was done to replace the zeros with the length of the SECOND transaction (let’s say 50,000). So, zpaq makes the first jump, reaches the second transaction, and expects to find a length (in our case, 50,000). If the length is there, it means the file is correct. It skips forward by 50,000. If the file ends there, that was the last transaction. If you find something, check again to see if it’s a THIRD transaction, and so on. In short, it’s chained. If, for some reason, there is data (meaning the file hasn’t ended), BUT inside the transaction block there are zeros (which aren’t actually zeros, but let’s keep it simple), THEN zpaq gives an incomplete transaction message and simply ignores it. But this generally shouldn’t happen. Everything clear so far?

fcorbelli commented 1 month ago

Because it’s MUCH more complicated than that. Let’s pretend we’re not considering encrypted files, but there are TWO other functions that make everything more complex. One is in zpaq, and the other is in zpaqfranz. They are multipart files (which was actually MY very first innovation to zpaq, over 10 years ago, let’s call it the proto-zpaqfranz, which Mahoney later integrated as a standard function in zpaq). In the case of multipart files, transactions are written to DIFFERENT files but have the EXACT same format (there’s also the index issue, but again, nothing is simple, and we’d need an even deeper explanation here).

Anyway, it can happen that a certain file exists (for example, backup0012.zpaq), but INSIDE that file, the C block contains the famous zeros. zpaq, upon reading it, says, “great, we have a file that exists, but it’s actually an incomplete transaction.” This (cut) can "mess up" the procedure for determining subsequent files (backup0013.zpaq, backup0014.zpaq), and so on.

On top of this, there’s an additional difficulty in zpaqfranz due to the user-requested feature that was never implemented by Mahoney and is objectively challenging: fixed-length chunks. This means that when using -chunk X, the transactions DO NOT start at the beginning of individual .zpaq files. If we also introduce encryption, everything becomes even more complicated because of the 32-byte salt at the start, which requires extreme care to properly encrypt the data when posting with the SEEK backwards within the transaction (i.e., in the C block, which initially contains the zeros).

While it’s “easy” to SEEK and write data, it’s not easy at all if you need to encrypt the data and have a salted key. Is this explanation enough? Because it’s about 15% of what’s really happening.

fcorbelli commented 1 month ago

I hope this "small" introduction (which completely leaves out fragmentation and blocks) gives you a "vague" idea of how writing the documentation is not at all simple, even when using ChatGPT for the translation. 😄 😄 😄 😄 😄 😄 😄 😄 😄

fcorbelli commented 1 month ago

In this "piece" you can see the C block and the jump decoding (and no, there aren't ZEROs, because a transaction can be 0 bytes long. But this will ignite another "spiegone")

                        if (filename.s[17]=='c')
                        {
                            if (os.size()<8)
                                error("c block too small");
                            data_offset=in.tell()+1-d.buffered();
                            const char* s=os.c_str();
                            int64_t jmp=btol(s);
                            if (flagdebug3)
                                myprintf("02973: jump %s\n",migliaia(jmp));
                            if (jmp<0)
                                myprintf("02974: Incomplete transaction ignored\n");
fcorbelli commented 1 month ago

Here you can find the extimation of RAM needed, by the findBlock

bool Decompresser::findBlock(double* memptr) {
  assert(state==BLOCK);
  // Find start of block
  U32 h1=0x3D49B113, h2=0x29EB7F93, h3=0x2614BE13, h4=0x3828EB13;
  // Rolling hashes initialized to hash of first 13 bytes
  int c;
  while ((c=dec.get())!=-1) {
    h1=h1*12+c;
    h2=h2*20+c;
    h3=h3*28+c;
    h4=h4*44+c;
    if (h1==0xB16B88F1 && h2==0xFF5376F1 && h3==0x72AC5BF1 && h4==0x2F909AF1)
      break;  // hash of 16 byte string
  }
  if (c==-1) return false;
  // Read header
  if ((c=dec.get())!=1 && c!=2) error("unsupported ZPAQ level");
  if (dec.get()!=1) error("unsupported ZPAQL type");
  z.read(&dec);
  if (c==1 && z.header.isize()>6 && z.header[6]==0)
    error("ZPAQ level 1 requires at least 1 component");
  if (memptr) *memptr=z.memory();
  state=FILENAME;
  decode_state=FIRSTSEG;
  return true;
}

z, BTW, is a ZPAQL that holds PCOMP

OK, it's dinner time 🍰

mirogeorg commented 1 month ago

Alright, what should I test on version 60.7q?

mirogeorg commented 1 month ago

What does the red ERROR_NO_MORE_FILES error in the screenshot above mean?

fcorbelli commented 1 month ago

You should use -debug, -debug2 or -debug3

This seems a multipart archive with an incomplete transaction (aka: a run not ended up in the right way)

mirogeorg commented 1 month ago

Franco, can you return an error during the test if there is a problem with the blocks, because it's clear that the archive will be corrupted. The exit code should be 'error', and the script for error output should be executed.

Yes, the archive is multipart, but actually, it consists only of the first part, which, as you can see from the screenshots, has been created correctly, and the test passes as OK. But it's clear that if this message appears, nothing is really OK.

Because with larger volumes, even with -fragment 6, the same problem would still occur, right?

I will test with -debug, -debug2, -debug3 and will provide feedback.

fcorbelli commented 1 month ago

Franco, can you return an error during the test if there is a problem with the blocks, because it's clear that the archive will be corrupted. The exit code should be 'error', and the script for error output should be executed.

60.7q already return an error Even if it is not sure that this is an error

mirogeorg commented 1 month ago

image

....

image

looks good now (with or without -debug). thank you!

fcorbelli commented 1 month ago

Please update to 60.7t This is about 99.9% of the next release And you can check the auto-upgrade too 😄