fcorbelli / zpaqfranz

Deduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix
MIT License
259 stars 22 forks source link

Deleting old versions, add new files to an existing archive #92

Closed adamantida closed 7 months ago

adamantida commented 7 months ago
  1. Is it possible to delete old versions inside the archive and will it be added in the next updates?
  2. When adding files to an existing archive, is it necessary to specify the compression ratio? Or is the compression that was in the archive before adding new files applied?
fcorbelli commented 7 months ago
  1. Is it possible to delete old versions inside the archive and will it be added in the next updates?

It is not (pratically) possible to delete old versions from an archive. This is because there are no "versions," but rather files divided into fragments and stored inside blocks. So the very same single fragment T could be used by file X in version 3, and file Y in version 2027 (short explanation) Deduplication does not occur at the file level, but at the "file portion" level

It is possible to "repackage" all fragments referenced by a certain version into a new archive

zpaqfranz x z:\1.zpaq *.xls -repack onlyxls.zpaq

But I don't think that's your question If the problem involves creating files that are too large to be handled safely, you can "freeze" them like this

zpaqfranz a z:\h.zpaq c:\nz\ -freeze y:\archived-zpaq-folder -maxsize 10000000000

Short version: no

  1. When adding files to an existing archive, is it necessary to specify the compression ratio? Or is the compression that was in the archive before adding new files applied?

No, because there is no compression setting at the file level, but rather at the individual fragment level (*block actually) So whatever compression level you set for a file (suppose Y), IF a fragment is already in the archive, with level X, that fragment will remain compressed with level X Only different portions of the file (if any) will be compressed with level Y There is an attempt in zpaqfranz to adopt a different method, more like this (i.e., file-level), but I am not pursuing it for now (very complex, little gain)

It should also be kept in mind that for all levels except 5, zpaq does NOT compress files that it believes are already compressed, such as .7z, .zipper, .mp4, and so on It does so based on an estimate of the "compressibility" of the files So even a high level of compression has a modest impact, if performed on a set of files that are already mostly compressed This is not the case for level 5 ("placebo"), which attempts to compress anyway

If you want some "BTS"... https://encode.su/threads/456-zpaq-updates?p=81360&viewfull=1#post81360 https://encode.su/threads/456-zpaq-updates?p=81361&viewfull=1#post81361 https://encode.su/threads/456-zpaq-updates?p=81364&viewfull=1#post81364 https://encode.su/threads/456-zpaq-updates?p=81365&viewfull=1#post81365 https://encode.su/threads/456-zpaq-updates?p=81366&viewfull=1#post81366 you can use the "dump" command

zpaqfranz dump c:\due.zpaq -to z:\1.txt -verbose

to "look inside" an archive You'll get 4 block type

88226: c block (jump)          1
88227: d block (data)         10
88228: h block (hash)         10
88229: i block (index)         8

c block are version-jump. In this case, just a single version d block store the "real" data, where h block store the hash of the "pieces". You have a bijective corrispondence for d and h block. As many data block, as many hashes i block are "index", aka file storage blocks (where filenames and fragment list are kept)

In this example

Block 00000027 at           123.858.728: mem                1.116.478 123.863.040
 (same model as block 22)
  60869ac6 jDC20240202184947i0000000006 |     16.057| ->        6.328 I (index)
2020-02-16 01:47:14 C:/7/paige/paige502.jpg  FRAGS (#4):  336-339

the file paige502.jpg is made of 4 fragments, 336 to 339 If these fragments are part of a single block, they will have a single compression level (whatever that is) But a file can "scatter" everywhere, just like this

2024-01-17 17:04:36 c:/zpaqfranz/1.cpp  FRAGS (#46):  61-81 27 82 29-31 83-86 39 40 87 42-44 88-97

There are 46 fragments, 61 to 81, then 27, then 82... upto 88-97 These fragments can be parts of different blocks, each (block) with its own level of compression So, to recap, a single file can, in the general case, be compressed by several different methods, not just one.

So it may make sense to operate an initial compression (first version) of an entire archive with a high level, example 4. This will take a very long time, but will produce a small file (ex. nighttime) Then, in subsequent updates, use the default level (1), which is very fast, to quickly update (ex daytime) Each subsequent backup will be fast, but the overall space will be less than it took using level 1 all the time

Briefly, flexibility

A typical example is mboxes, normally from thunderbird, where there is a very large stock of e-mails received over the years, but virtually never opened again, and a new portion of newly arrived messages Something like this |large portion of messages up to 2023 (100GB)||messages from 2024 (100MB)|

There are other arrangements possible, such as fragment and block size, to "squeeze out" the maximum performance However, I do NOT suggest this: the default settings are (in my opinion) a very good compromise in the average case; the user does not have to worry about these details

Short version: no, zpaq does NOT store the compression level of a file

adamantida commented 7 months ago

Thank you for answering my questions. Hopefully, if others are wondering these questions, they will see this issues.