fcorbelli / zpaqfranz

Deduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix
MIT License
259 stars 22 forks source link

A test with the files which actually collide #82

Open ghost opened 11 months ago

ghost commented 11 months ago

Attached is a zip file with two files which have the same sha1 but different sha256.

I'm trying to add the folder with them to the zpaqfranz archive, and I always see only one of them stored under the two names, not two of them:

C:\tmp\collision>unzip collision-example.zip
Archive:  collision-example.zip
   creating: baad/
 extracting: baad/messageA
 extracting: baad/messageB

C:\tmp\collision>openssl dgst -sha256 baad\*
SHA256(baad\messageA)= 3ead211681cec93d265c8ac123dd062e105408cebf82fa6e2b126f4f40bcb88c
SHA256(baad\messageB)= 208feafe1c6a95c73f662514ac48761f25e1f3b74922521a98d9ce287f4a2197

C:\tmp\collision>openssl dgst -sha1 baad\*
SHA1(baad\messageA)= 8ac60ba76f1999a1ab70223f225aefdc78d4ddc0
SHA1(baad\messageB)= 8ac60ba76f1999a1ab70223f225aefdc78d4ddc0

C:\tmp\collision>zpaqfranz.exe a baad.zpaqf baad
zpaqfranz v58.10o-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-01)
franz:-noconsole
Creating baad.zpaqf at offset 0 + 0
Add 2023-10-02 03:39:47         2              1.280 (   1.25 KB) 8T (1 dirs)
3 +added, 0 -removed.

0 + (1.280 -> 640 -> 1.842) = 1.842 @ 26.60 KB/s

0.047 seconds (000:00:00) (all OK)

C:\tmp\collision>mkdir result

C:\tmp\collision>cd result

C:\tmp\collision\result>..\zpaqfranz.exe x ..\baad.zpaqf
zpaqfranz v58.10o-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-01)
franz:-noconsole
../baad.zpaqf:
1 versions, 3 files, 1.842 bytes (1.80 KB)
Extract 1.280 bytes (1.25 KB) in 2 files (1 folders) / 8 T

0.031 seconds (000:00:00) (all OK)

C:\tmp\collision\result>openssl dgst -sha256 baad\*
SHA256(baad\messageA)= 3ead211681cec93d265c8ac123dd062e105408cebf82fa6e2b126f4f40bcb88c
SHA256(baad\messageB)= 3ead211681cec93d265c8ac123dd062e105408cebf82fa6e2b126f4f40bcb88c

Obviously one content under two names is there.

I've read "additional checks" are default, it's unexpected. Maybe the switch is needed:

C:\tmp\collision\result>cd ..

C:\tmp\collision>zpaqfranz.exe a baad-2.zpaqf baad -sha256
zpaqfranz v58.10o-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-01)
franz:-sha256 -noconsole
Creating baad-2.zpaqf at offset 0 + 0
Add 2023-10-02 03:41:54         2              1.280 (   1.25 KB) 8T (1 dirs)
3 +added, 0 -removed.

0 + (1.280 -> 640 -> 1.982) = 1.982 @ 26.60 KB/s

0.047 seconds (000:00:00) (all OK)

C:\tmp\collision>mkdir result-256

C:\tmp\collision>cd result-256

C:\tmp\collision\result-256>..\zpaqfranz.exe x ..\baad-2.zpaqf
zpaqfranz v58.10o-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-01)
franz:-noconsole
../baad-2.zpaqf:
1 versions, 3 files, 1.982 bytes (1.94 KB)
Extract 1.280 bytes (1.25 KB) in 2 files (1 folders) / 8 T

0.032 seconds (000:00:00) (all OK)

C:\tmp\collision\result-256>openssl dgst -sha256 baad\*
SHA256(baad\messageA)= 3ead211681cec93d265c8ac123dd062e105408cebf82fa6e2b126f4f40bcb88c
SHA256(baad\messageB)= 3ead211681cec93d265c8ac123dd062e105408cebf82fa6e2b126f4f40bcb88c

collision-example.zip

fcorbelli commented 11 months ago

This is expected behavior
There is no way to avoid SHA-1 collisions (other than breaking backwards compatibility of .zpaq archives, which I don't want to do)

zpaq simply doesn't detect anything, it doesn't provide any warnings. Look at this example

Z:\pippo>zpaqfranz sum message* -sha256
zpaqfranz v58.10n-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-09-30)
franz:sum                                       1 - command
franz:-sha256 -hw
Getting SHA-256 ignoring .zfs and :$DATA

No multithread: Found (1.25 KB) => 1.280 bytes (1.25 KB) / 2 files in 0.000000
|SHA-256: 208FEAFE1C6A95C73F662514AC48761F25E1F3B74922521A98D9CE287F4A2197 [                640]     |messageB
|SHA-256: 3EAD211681CEC93D265C8AC123DD062E105408CEBF82FA6E2B126F4F40BCB88C [                640]     |messageA

SHA-1 collision

Z:\pippo>zpaqfranz sum message* -sha1
zpaqfranz v58.10n-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-09-30)
franz:sum                                       1 - command
franz:-sha1 -hw
Getting SHA-1 ignoring .zfs and :$DATA

No multithread: Found (1.25 KB) => 1.280 bytes (1.25 KB) / 2 files in 0.000000
|SHA-1: 8AC60BA76F1999A1AB70223F225AEFDC78D4DDC0 [                640]     |messageA
|SHA-1: 8AC60BA76F1999A1AB70223F225AEFDC78D4DDC0 [                640] === |messageB
Z:\pippo>zpaq64 a 715.zpaq *
zpaq v7.15 journaling archiver, compiled Aug 17 2016
Creating 715.zpaq at offset 0 + 0
Adding 0.001280 MB in 2 files -method 14 -threads 32 at 2023-10-02 08:28:34.
+ messageA 640
+ messageB 640 -> 0
[1..1] 652 -method 14,20,0
2 +added, 0 -removed.

0.000000 + (0.001280 -> 0.000640 -> 0.001739) = 0.001739 MB
0.016 seconds (all OK)

Z:\pippo>zpaq64 x 715.zpaq -test
zpaq v7.15 journaling archiver, compiled Aug 17 2016
715.zpaq: 1 versions, 2 files, 1 fragments, 0.001739 MB
Extracting 0.001280 MB in 2 files -threads 32
[1..1] -> 652
> messageA
> messageB
0.000 seconds (all OK)

Then

Z:\pippo>zpaq64 x 715.zpaq -to extracted
zpaq v7.15 journaling archiver, compiled Aug 17 2016
715.zpaq: 1 versions, 2 files, 1 fragments, 0.001739 MB
Extracting 0.001280 MB in 2 files -threads 32
[1..1] -> 652
> extracted/messageA
> extracted/messageB
0.016 seconds (all OK)

Finally

Z:\pippo>zpaqfranz sum extracted -sha256
zpaqfranz v58.10n-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-09-30)
franz:sum                                       1 - command
franz:-sha256 -hw
Getting SHA-256 ignoring .zfs and :$DATA

No multithread: Found (1.25 KB) => 1.280 bytes (1.25 KB) / 2 files in 0.016000
|SHA-256: 3EAD211681CEC93D265C8AC123DD062E105408CEBF82FA6E2B126F4F40BCB88C [                640]     |extracted/messageA
|SHA-256: 3EAD211681CEC93D265C8AC123DD062E105408CEBF82FA6E2B126F4F40BCB88C [                640] === |extracted/messageB

0.079 seconds (000:00:00) (all OK)

As you can see, zpaq

zpaq happily extracts the data, without any warning that a collision has occurred. Attention, we are talking about DETECTION, not AVOIDANCE (impossible to avoid with this archive file format)

fcorbelli commented 11 months ago

Now let's see zpaqfranz default

Z:\pippo>zpaqfranz a zpaqfranz.zpaq messag*
zpaqfranz v58.10n-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-09-30)
franz:-hw
Creating zpaqfranz.zpaq at offset 0 + 0
Add 2023-10-02 10:33:04         2              1.280 (   1.25 KB) 32T (0 dirs)
2 +added, 0 -removed.

0 + (1.280 -> 640 -> 1.803) = 1.803 @ 83.33 KB/s

0.031 seconds (000:00:00) (all OK)

This time, if you test the file, you'll get

Z:\pippo>zpaqfranz t zpaqfranz.zpaq
zpaqfranz v58.10n-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-09-30)
franz:-hw
zpaqfranz.zpaq:
1 versions, 2 files, 1.803 bytes (1.76 KB)
To be checked 1.280 in 2 files (32 threads)
7.15 stage time       0.02 no error detected (RAM ~34.07 MB), try CRC-32 (if any)
Checking                 2 blocks with CRC-32 (1.280 not-0 bytes)
ERROR:  STORED CRC-32 92433266 != DECOMPRESSED 072E2B0E (ck 00000001) messageB

CRC-32 time           0.00s
Blocks               1.280 (           2)
Zeros                    0 (           0) 0.000000 s
Total                1.280 speed 1.280.000/sec (1.22 MB/s)
ERRORS        : 00000001 (ERROR in rebuilded CRC-32, SHA-1 collisions?)
-----------------------------------------------------------------------------------------------------
GOOD            : 00000001 of 00000002 (stored=decompressed)
WITH ERRORS

0.032 seconds (000:00:00) (with warnings)

If you put a "-verify" in add

Z:\pippo>zpaqfranz a test2.zpaq mess* -verify
zpaqfranz v58.10n-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-09-30)
franz:-hw -verify
Creating test2.zpaq at offset 0 + 0
Add 2023-10-02 10:35:52         2              1.280 (   1.25 KB) 32T (0 dirs)
29604 SOMETHING WRONG ON messageA

GURU-C: on file messageA
GURU: CRC-32 from fragments 92433266
GURU: CRC-32 from file      072E2B0E
2 +added, 0 -removed.

0 + (1.280 -> 640 -> 1.794) = 1.794 @ 40.32 KB/s

Short version: no way to DETECT a SHA-1 collision in zpaq yes, you can DETECT SHA-1 collision in zpaqfranz

EDIT: for performance reasons, the double-check is NOT active during add() but in test(). Older zpaqfranz builds do this way, but it was a modified behavior so as not to slow down in the normal case

ghost commented 11 months ago

EDIT: for performance reasons, the double-check is NOT active during add() but in test(). Older zpaqfranz builds do this way, but it was a modified behavior so as not to slow down in the normal case

When I pack 3 files, messageA, messageB and some-big.mp4 then, to find the collision, if I invoke "t" the time needed will be proportional to the number of all the bytes in the archive:

Checking                68 blocks with CRC-32 (1.089.413.820 not-0 bytes)
ERROR:  STORED CRC-32 92433266 != DECOMPRESSED 072E2B0E (ck 00000001) messageB

CRC-32 time           0.01s
Blocks       1.089.413.820 (          68)
Total        1.089.413.820 speed 34.044.181.875/sec (31.71 GB/s)
  ...
ERRORS        : 00000001 (ERROR in rebuilded CRC-32, SHA-1 collisions?)
   ...
6.375 seconds (000:00:06) (with warnings)

But if I could just get a list of the filenames, the values of the stored crc32 and the values of the sha1 of each file, I could detect the collisions in the above example in a few milliseconds, i.e. not bound by the number of bytes stored. It's just a comparison of crc32 values for the same sha1 values.

1) How can I get such a list from the archive using your program? If it's not possible, what do you think about implementing in your program such kind of listing?

2) What do you think about implementing a possibility to perform such kind of a collision detection in your program (not bound by the number of bytes stored)?

fcorbelli commented 11 months ago

When I pack 3 files, messageA, messageB and some-big.mp4 then, to find the collision, if I invoke "t" the time needed will be proportional to the number of all the bytes in the archive: (...) But if I could just get a list of the filenames, the values of the stored crc32 and the values of the sha1 of each file, I could detect the collisions in the above example in a few milliseconds, i.e. not bound by the number of bytes stored. It's just a comparison of crc32 values for the same sha1 values.

1. How can I get such a list from the archive using your program? If it's not possible, what do you think about implementing in your program such kind of listing?

with the -checksum switch

zpaqfranz l test.zpaq -checksum
2. What do you think about implementing a possibility to perform such kind of a collision detection in your program (not bound by the number of bytes stored)?

This require to compute and store the SHA-1 of the entire file (aka: overhead) But yes, you can, if you really want

zpaqfranz a test5 message* -sha1
Z:\pippo>zpaqfranz l test5.zpaq -checksum
zpaqfranz v58.10n-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-09-30)
franz:-checksum -hw
test5.zpaq:
1 versions, 2 files, 1.810 bytes (1.77 KB)

- 2023-10-02 00:53:08                 640 A     SHA-1: 8AC60BA76F1999A1AB70223F225AEFDC78D4DDC0 CRC32: 072E2B0E messageA
- 2023-10-02 00:53:11                 640 A     SHA-1: 8AC60BA76F1999A1AB70223F225AEFDC78D4DDC0 CRC32: 92433266 messageB

                1.280 (1.25 KB) of 1.280 (1.25 KB) in 2 files shown
                1.810 compressed

0.000 seconds (000:00:00) (all OK)
ghost commented 11 months ago

For my example I see:

 ../zpaqfranz.exe l ../a1.zpaq -checksum
   ...
- 2023-10-02 00:53:08                 640 A     XXHASH64: 5ED04C85E7C36E96 CRC32: 072E2B0E messageA
- 2023-10-02 00:53:11                 640 A     XXHASH64: E9127211F02DB4EE CRC32: 92433266 messageB
- 2022-05-19 15:12:12       1.089.412.540 A     XXHASH64: 27841A60E9F3A8CF CRC32: A55A20FE some-big.mp4

That is, I see different XXHASH64 anyway, so crc32 appears completely unnecessary?

Then, if I'd use the original zpaq (I can't get the same result with your) I can get the list of the files which are "same" even if now I know they have different XXHASH64:

zpaq l ../a1.zpaq -s-1
zpaq v7.15 journaling archiver, compiled Aug 17 2016
../a1.zpaq: 1 versions, 3 files, 15292 fragments, 952.652696 MB

- 2023-10-01 22:53:08          640 A     messageA 1
- 2023-10-01 22:53:11          640 A     messageB 1
- 2022-05-19 13:12:12   1089412540 A     some-big.mp4 2-15292    

Which means that I can recognize the collisions without decompressing all the files, without even knowing crc32, and all that based on the information already stored in the existing archive packed by zpaqfranz

What do you think about all that?

fcorbelli commented 11 months ago

For my example I see:

 ../zpaqfranz.exe l ../a1.zpaq -checksum
You missed the -sha1 switch (creating the archive)

That is, I see different XXHASH64 anyway, so crc32 appears completely unnecessary?
No, because CRC-32 is used for SHA-1-collision-detection

Then, if I'd use the original zpag (I can't get the same result with your) I can get the list of the files which are "same" even if now I know they have different XXHASH64:

zpaq l ../a1.zpaq -s-1 zpaq v7.15 journaling archiver, compiled Aug 17 2016 ../a1.zpaq: 1 versions, 3 files, 15292 fragments, 952.652696 MB

  • 2023-10-01 22:53:08 640 A messageA 1
  • 2023-10-01 22:53:11 640 A messageB 1
  • 2022-05-19 13:12:12 1089412540 A random.bin 2-15292

Which means that I can recognize the collisions without decompressing all the files, without even knowing crc32, and all that based on the information already stored in the existing archive packed by zpaqfranz

What do you think about all that?

You can't at all, with zpaq

With zpaqfranz you can (manually) if you use the -sha1 switch (when adding files)

I can make a new (simply) switch that

Or a new (more work) switch that

ghost commented 11 months ago

You missed the -sha1 switch (creating the archive)

I have just given you an example where without that -sha1 all necessary info to report the collision is already stored in the file!

Specifically, it is known that both files share the same fragment (fragment 1):

 - 2023-10-01 22:53:08          640 A     messageA 1
 - 2023-10-01 22:53:11          640 A     messageB 1

which is what zpaq is able to list with

 zpaq l ../a1.zpaq -s-1

as I've already shown. Please try yourself.

fcorbelli commented 11 months ago

No, because different files can share the same fragment That's exactly the "deduplicator"

fcorbelli commented 11 months ago
  jDC20231002114900c0000000001 8 jDC ->  8 OK
    csize = 1061 at 104
  jDC20231002114900d0000000001 652 jDC ->  652 OK
    1061 -> 652 [1..2) = 640 OK
  jDC20231002114900h0000000001 28 jDC ->  28 OK
    [1..2) 1061 -> 640 OK
  jDC20231002114900i0000000001 68 jDC ->  68 OK
    20231001225308 messageA 7720000000 1
    20231001225311 messageB 7720000000 1

Nothing "strange" here

ghost commented 11 months ago

No, because different files can share the same fragment That's exactly the "deduplicator"

If two files share all the fragments, but they have different XXHASH64 they were deduplicated due to the same sha1 even if they shouldn't have been (the different fragments should have been stored). I assumed that is exactly what one would like to know?

fcorbelli commented 11 months ago

I could extract the list of fragments of each file, and verify that those with the same list have different CRC-32. Yes, I can, but does it make sense? This is a function for command t (test) not of a (add) A corruption of a block, which you cannot check from the fragment list, is much more frequent (and dangerous) than a SHA-1 collision

ghost commented 11 months ago

does it make sense?

Why would anybody want to not be notified of data loss produced by the execution of "a" ?

If that would be immediately reported, one would be able to immediately take actions to still preserve the data which could otherwise be lost at the moment one does "t" which can take time. I am not suggesting stopping the process, but precisely reporting the problem of "a" which could be known at the very time "a" is perfomed.

fcorbelli commented 11 months ago

does it make sense?

Why would anybody want to not be notified of data loss produced by the execution of "a" ?

Because it's veeeery uncommon

If that would be immediately reported, one would be able to immediately take actions to still preserve the data which could otherwise be lost at the moment one does "t" which can take time. I don't suggest stopping the process, but precisely reporting the problem of "a" which could be known at the very time "a" is perfomed.

zpaqfranz can already do that, but with more overhead Some users want faster, not safer.

I need to do some tests on fragment packaging times, for a quantitative evaluation

ghost commented 11 months ago

If that would be immediately reported, one would be able to immediately take actions to still preserve the data which could otherwise be lost at the moment one does "t" which can take time.

zpaqfranz can already do that, but with more overhead

Again, I still miss this: how can one recognize the error of "a" without running "t" in zpaqfranz ?

fcorbelli commented 11 months ago

I'll try to explain better During the add phase, for each file, I should keep the list of fragments that compose it (somehow compact) and its CRC-32, and then search for each file already in the archive (and there can be millions) , if the list of fragments (which can be hundreds of thousands per file) matches. It's doable, with a little effort, but the overhead could be significant, i.e. slowing down every single add, whether the collision occurs or not

fcorbelli commented 11 months ago

If that would be immediately reported, one would be able to immediately take actions to still preserve the data which could otherwise be lost at the moment one does "t" which can take time.

zpaqfranz can already do that, but with more overhead

Again, I still miss this: how can one recognize the error of "a" without running "t" in zpaqfranz ?

With the -verify switch

ghost commented 11 months ago

Again, I still miss this: how can one recognize the error of "a" without running "t" in zpaqfranz ?

With the -verify switch

Ah, thanks, I still don't understand what is the cause of performance hit you claim in:

[EDIT: for performance reasons, the double-check is NOT active during add() but in test().

I still believe the overhead of the verification if there was data loss due to the run of "a" could be practically unnoticeable, compared to the current cost of "t"

Without the existence of the list of fragments the decompression is impossible, and that list could be generated, if it already doesn't exist, much faster than decompressing all the files.

And if the "list" of fragments in the archive is already a stored as a tree then comparing if two files are stored as the same "list" of fragments is equivalent to comparison if two files have the identical root node?

ghost commented 11 months ago

I fully agree with you that "a corruption of a block, which you cannot check from the fragment list, is much more frequent (and dangerous) than a SHA-1 collision".

fcorbelli commented 11 months ago

Right now, just for fun, I'm writing a hash "packager" of the fragment verctos. The real question to ask is whether I also want to keep the file name or not That is, whether to have a "SHA-1 collision found" warning, period, or "collision with file X". In the second case more memory and more time, but perhaps it is better

...work in progress...

ghost commented 11 months ago

IMO it should be not about "sha1 collision" as such but about the "data loss" due to the deduplication algorithm used in "a", even if it were something else and I think such "data loss" can be detected as soon as the algorithm used for deduplication is not the same as some checksum which is also available in the process.

Regarding the file names, I would expect they are also in some way deduplicated in RAM and only references to the "roots" of them used in the code?

ghost commented 11 months ago

IMO it should be not about "sha1 collision" as such but about the "data loss" due to the deduplication algorithm used in "a", even if it were something else and I think such "data loss" can be detected as soon as the algorithm used for deduplication is not the same as some checksum which is also available in the process.

And I still don't know if some additional checksum is anyway computed, could it be used to even automatically avoid data loss during "a" due to otherwise "too simple" deduplication algorithm -- to automatically store what would be otherwise "lost".

fcorbelli commented 11 months ago

And I still don't know if some additional checksum is anyway computed, could it be used to even automatically avoid data loss during "a" due to otherwise "too simple" deduplication algorithm -- to automatically store what would be otherwise "lost".

The archive format does not support anything different.
Standard zpaq does not store any kind of checksum (no CRC-32) at file level.
It is not possible to store more data for single fragments

ghost commented 11 months ago

Standard zpaq does not store any kind of checksum (no CRC-32) at file level. It is not possible to store more data for single fragments

That is about what is stored in the archive. But your program calculates more by default, and maybe that information could be used to force storing "colliding" fragments in the archive, even if the original zpaq would not store them, and still keep the archive readable by zpaq. E.g. if zpaq identifies the stored as the numbers (that's how zpaq reports it) then why not storing messageA as fragment 1, messageB as fragment 2 in that example, if it could be known that messageA and messageB are not the same?

fcorbelli commented 11 months ago

Standard zpaq does not store any kind of checksum (no CRC-32) at file level. It is not possible to store more data for single fragments

That is about what is stored in the archive. But your program calculates more by default, and maybe that information could be used to force storing "colliding" fragments in the archive, even if the original zpaq would not store them, and still keep the archive readable by zpaq. E.g. if zpaq identifies the stored as the numbers (that's how zpaq reports it) then why not storing messageA as fragment 1, messageB as fragment 2 in that example, if it could be known that messageA and messageB are not the same?

zpaq does not identifies files by numbers, but by full filename
<"messageA",[list of fragments]> <"messageB",[list of fragments]>

fcorbelli commented 11 months ago

Please check the attached pre-release 58_11a.zip

ghost commented 11 months ago

Can you please explain what can be seen in that 58_11a.zip? I some fail to see anything on my example of these 3 files.

fcorbelli commented 11 months ago

If you do anything, for example a l (list), you should see a collision detected

fcorbelli commented 11 months ago
Z:\pippo>zpaqfranz a test1.zpaq message*
zpaqfranz v58.11a-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-02)
franz:-hw
Creating test1.zpaq at offset 0 + 0
Add 2023-10-02 15:43:25         3              1.920 (   1.88 KB) 32T (0 dirs)
3 +added, 0 -removed.

0 + (1.920 -> 640 -> 1.811) = 1.811 @ 117.19 KB/s

0.016 seconds (000:00:00) (all OK)

Z:\pippo>zpaqfranz l test1.zpaq
zpaqfranz v58.11a-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-02)
franz:-hw
test1.zpaq:
1 versions, 3 files, 1.811 bytes (1.77 KB)

#####################################################################################################
87487: **** SUSPECTED SHA-1 COLLISION ****
File   <<messageC>>
map to <<messageA>>
#####################################################################################################

- 2023-10-02 00:53:08                 640 A     messageA
- 2023-10-02 00:53:11                 640 A     messageB
- 2023-10-02 00:53:08                 640 A     messageC

                1.920 (1.88 KB) of 1.920 (1.88 KB) in 3 files shown
                1.811 compressed

0.016 seconds (000:00:00) (all OK)

Z:\pippo>
fcorbelli commented 11 months ago
Z:\pippo>zpaqfranz t test1.zpaq
zpaqfranz v58.11a-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-02)
franz:-hw
test1.zpaq:
1 versions, 3 files, 1.811 bytes (1.77 KB)

#####################################################################################################
87487: **** SUSPECTED SHA-1 COLLISION ****
File   <<messageC>>
map to <<messageA>>
#####################################################################################################

To be checked 1.920 in 3 files (32 threads)
7.15 stage time       0.01 no error detected (RAM ~34.07 MB), try CRC-32 (if any)
Checking                 3 blocks with CRC-32 (1.920 not-0 bytes)
ERROR:  STORED CRC-32 92433266 != DECOMPRESSED 072E2B0E (ck 00000001) messageB

CRC-32 time           0.00s
Blocks               1.920 (           3)
Zeros                    0 (           0) 0.000000 s
Total                1.920 speed 1.920.000/sec (1.83 MB/s)
ERRORS          : 00000001 (ERROR in rebuilded CRC-32, SHA-1 collisions?)
-----------------------------------------------------------------------------------------------------
GOOD            : 00000002 of 00000003 (stored=decompressed)
WITH ERRORS

0.063 seconds (000:00:00) (with warnings)
fcorbelli commented 11 months ago

OK, my fault, I am running with 3 different files 😄

fcorbelli commented 11 months ago

Just a "bit" of overkill: fragment collision checked with TWO hashes: SHA-256 AND SHA-3 58_11c.zip

Please try

ghost commented 11 months ago

It's fast clearly fast:

C:\tmp\collision\baad>..\zpaqfranz a ..\a.zpqf *
zpaqfranz v58.11c-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-02)
franz:-noconsole
Creating ../a.zpqf at offset 0 + 0
Add 2023-10-02 18:37:32         3      1.089.413.820 (   1.01 GB) 8T (0 dirs)
3 +added, 0 -removed.

0 + (1.089.413.820 -> 1.089.413.180 -> 952.652.699) = 952.652.699 @ 42.81 MB/s

24.328 seconds (000:00:24) (all OK)

C:\tmp\collision\baad>..\zpaqfranz l ..\a.zpqf
zpaqfranz v58.11c-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-02)
franz:-noconsole
../a.zpqf:
1 versions, 3 files, 952.652.699 bytes (908.52 MB)

##########################################################################################
87487: **** SUSPECTED SHA-1 COLLISION ****
Fragment hash SHA-256 67ABDD721024F0FF4E0B3F4C2FC13BC5BAD42D0B7851D456D88D203D15AAA450
Fragment hash SHA-3   295CD1698C6AC5BD804A09E50F19F8549475E52DB1C6EBD441ED0C7B256E1DDF
File   CRC-32 92433266 <<messageB>>
map to CRC-32 072E2B0E <<messageA>>
##########################################################################################

- 2023-10-02 00:53:08                 640 A     messageA
- 2023-10-02 00:53:11                 640 A     messageB
- 2022-05-19 15:12:12       1.089.412.540 A     some-big.mp4

        1.089.413.820 (1.01 GB) of 1.089.413.820 (1.01 GB) in 3 files shown
          952.652.699 compressed

0.062 seconds (000:00:00) (all OK)

Many times faster than anything before, in this test case just 0.062 seconds!

And just to be clear IMO there's nothing that should be called "suspected" here, and from the user's perspective it's not about the sha1 but about the missing data: the "a" action clearly failed to store in the archive data that a user would expect it to do, in spite of them being present at the beginning and the program clearly knowing that they are different from those stored!

fcorbelli commented 11 months ago

And just to be clear IMO there's nothing that should be called "suspected" here, and from the user's perspective it's not about the sha1 but about the missing data: the "a" action clearly failed to store in the archive data that a user would expect it to do, in spite of them being present at the beginning and the program clearly knowing that they are different from those stored!

No, the program (zpaq) does not "know" that the files are different, at all There is not the "concept" of file, inside zpaq There are fragments: a file is a vector of fragments And the same fragment (aka: with the same SHA-1) is considerated equal That's a SHA-1 collision on a fragment

ghost commented 11 months ago

I don't think it's true. The program decided to not store both fragments only because at that moment it depended on the logic of "they have same checksum X" (for the user, it doesn't matter which exact algorithm X is) while at the same time it can verify for sure that two fragments have provably different checksums Y (and even Z) and even compare the content byte-by-byte etc. Therefore the program can be at that moment practically 100% sure that the fragments do have different content and that one of the fragments just wasn't stored in the archive only due to the bad logic already described. AFAIK at the moment of deduplication it's known exactly what content is rejected due to bad deduplication conclusion.

fcorbelli commented 11 months ago

Well, I think to know how zpaq works 😄 and no, byte-by-byte comparing is infeasible, this require a full extraction of everything during the add phase Aka: 100x slower, maybe more You cannot compare an uncompressed "chunk" (aka: a piece of a file readed from filesystem) to a compressed one (aka: a compressed fragment inside the archive). You have to decompress the compressed "chunk", THEN compare byte by byte Moreover the compressed "chunk" can be everywhere (aka: rotational seek latency), with more that severe trashing on spinning drives

Short version: the "real" problem is the file archive format, not the ideas behind

ghost commented 11 months ago

no, byte-by-byte comparing is infeasible, this require a full extraction of everything during the add phase

I never suggested that. I'm claiming that

then the program could be already sure that the dedup logic DL1 is false and that it wrongly dedups what it shouldn't. As you before claim that it's "not sure" THEN, as the counterargument, only these bytes could be compared, which I don't think is ever needed. A checksum mismatch is a certainty: DL1 made a wrong decision to consider a lot of different bytes as the same and if you create such an archive, we are sure it contains wrong data for one of the files.

fcorbelli commented 11 months ago

even compare the content byte-by-byte etc

"even compare the content byte-by-byte etc"

if the dedup logic DL1 decides not to store tje fragment-candidate 2 looking at the checksum of X1 of the fragment 1 and checksum Xc2 of the fragment-candidate 2 and

you describe a "trivial" single version deduplicator, but zpaq does not work this way

you have a vector of fragments, each compressed, and with SHA-1 of uncompressed
the deduplicator "split" via a rolling hash the input file, computing SHA-1 of the uncompressed data. then search, in all previous fragment, if the same SHA-1 does exists. In this case, instead of storing the "unique" fragment, put a pointer to the founded one. If the SHA-1 is unique (in current SHA-1 list of all fragments), then the fragment is compressed, stored, and added to the pool of fragments.

On the next round, the new version or snapshot, everything start again. zpaq does NOT deduplicate only a single run, but ALL of them
This is why so much powerful

fcorbelli commented 11 months ago

58_11e.zip The attached pre-release (detection works only on command l) get the current close-to-release algorithm It slow down of about 10% the read_archive function
It is possible to check the speed with something like

zpaqfranz l the.zpaq -out z:\1.txt -silent
zpaqfranz l the.zpaq -out z:\1.txt -silent -715

-715 will disable collision check
I do not think to apply as default: a general slowdown is not good
I'll think about it

ghost commented 11 months ago

Let me restate my point again: it's nothing "suspected" there, it's a clear data loss:

1) you start with two different files (different content inside) on your disk 2) you create an archive 3) you lose the original disk, but "no problem you have an archive" 4) so you unpack the archive 5) now you have on the disk two identical files containing the content of one of the two initial files. The archive never contained what you wanted to preserve (the content of two files).

What is to be called "suspected" there? The case is clear. Data missing in the archive, and the report should say for which file exactly. Now I get:

87487: **** SUSPECTED SHA-1 COLLISION F42F94001FCB5351 ****
File   CRC-32 92433266 <<messageB>>
map to CRC-32 072E2B0E <<messageA>>

And it's unclear. Is the content of messageB or messageA actually missing in archive? You show two original crc32 values which would exist IF both were in the archive, but we're sure that the content of one of the two original files is actually NOT present in the archive. So, which one is missing in the archive? Can that be simply clearly written?

BTW what is 87487, can that mean anything to me? What is F42F94001FCB5351, can I use that? Once again: "collision" is not the word I as the user want to see (even if we discussed the internal logic so I'm aware what it is related to) I just want to know what is actually NOT in the archive, against the expectations.

And yes I know it's totally improbable for anybody of us to ever see that on files not intentionally created for the effect. But we do know more such files definitely could be intentionally made. That's why we discuss handling of such files on the example of these two.

fcorbelli commented 11 months ago

-1 you start with a file, you add to archive 0 you add another file to the archive. The first file do not exists anymore

What is 87487? The bookmark in the source code What is F42F94001FCB5351? The XXHASH of the fragment list of unsigned int Why suspected collision? Because it is, a suspected collision. Impossible to say for sure (see -1 and 0)

As a user you shoud know aka DETECT something is wrong
Btw I am thinking on the dirtiest workaround ever, aka encrypt the file, then add as a virtual one

But this become a lot of work, and very fragile too

ghost commented 11 months ago

-1 you start with a file, you add to archive 0 you add another file to the archive. The first file do not exists anymore

I think it is more complicated than that: zpaq (original) is " incremental, journaling archiver": I can still see the name of the file added in the step you called -1 and extract it, even if it is not present anywhere except in the archive in the step you called 0.

An example: unpacking the original zpaq, then running the following script in that directory:

mkdir ex2
cd ex2
copy ..\readme.txt .
zpaq a ..\e.zpaq *
zpaq l ..\e.zpaq -all -s-1
del readme.txt
copy ..\Makefile .
zpaq a ..\e.zpaq *
zpaq l ..\e.zpaq -all -s-1
del Makefile
zpaq x ..\e.zpaq -until 1

The first list command gives:

zpaq v7.15 journaling archiver, compiled Aug 17 2016
../e.zpaq: 1 versions, 1 files, 1 fragments, 0.003583 MB

- 2023-10-03 08:20:02         3791       0001/ +1 -0 -> 3583 1-1
- 2016-08-17 15:43:14         3791 A     0001/readme.txt 1

0.003791 MB of 0.003791 MB (2 files) shown
  -> 0.003791 MB (1 refs to 1 of 1 frags) after dedupe
  -> 0.003583 MB compressed.
0.000 seconds (all OK)

The second list command gives:

zpaq v7.15 journaling archiver, compiled Aug 17 2016
../e.zpaq: 2 versions, 2 files, 2 fragments, 0.005109 MB

- 2023-10-03 08:20:02         3791       0001/ +1 -0 -> 3583 1-1
- 2016-08-17 15:43:14         3791 A     0001/readme.txt 1
- 2023-10-03 08:20:03          872       0002/ +1 -1 -> 1526 2-2
- 2016-08-17 15:42:24          872 A     0002/Makefile 2
-                                0       0002/readme.txt

0.004663 MB of 0.004663 MB (4 files) shown
  -> 0.004663 MB (2 refs to 2 of 2 frags) after dedupe
  -> 0.005109 MB compressed.
0.000 seconds (all OK)

readme.txt is already deleted outside of the archive but it is still in the archive, and I can still unpack it with "until" so the last step gives:

zpaq v7.15 journaling archiver, compiled Aug 17 2016
../e.zpaq -until 1: 1 versions, 1 files, 1 fragments, 0.003583 MB
Extracting 0.003791 MB in 1 files -threads 8
[1..1] -> 3803
> readme.txt
0.000 seconds (all OK)
fcorbelli commented 11 months ago

As I said, zpaq does not detect or handle somehow SHA-1 collisions. Fullstop

This attached pre-release 58_11g.zip does

The -verbose switch shows more technical infos

ghost commented 11 months ago

Looks much better.

Still, a variation of my previous script shows that the data loss is undetected at the moment it happens, and also during the default list:

I have in a directory messageA, messageB and zpaqfranz, then if I run:

mkdir ex2
cd ex2
copy ..\messageA .
..\zpaqfranz a ..\e.zpaq *
..\zpaqfranz l ..\e.zpaq -all -s-1
del messageA
copy ..\messageB .
rem ---- 2. adding directory with only messageB ---
rem ---- 2.1 ADD ---
..\zpaqfranz a ..\e.zpaq *
rem ---- 2.2 LIST ---
..\zpaqfranz l ..\e.zpaq
rem ---- 2.2 LIST checksum ---
..\zpaqfranz l ..\e.zpaq -checksum
rem ---- 2.4 LIST ALL ---
..\zpaqfranz l ..\e.zpaq -all -s-1

No data loss will be recognized, even if the archive is known to be incorrect, until the last command. First, add a directory where messageB is, but messageA doesn't exist anymore:

>rem ---- 2. adding directory with only messageB --- 

>rem ---- 2.1 ADD --- 

>..\zpaqfranz a ..\e.zpaq * 
zpaqfranz v58.11f-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-03)
franz:-noconsole 
../e.zpaq: 
1 versions, 1 files, 1.764 bytes (1.72 KB)
Scanned          1 00:00:00      1.000 file/s (                  640)

Updating ../e.zpaq at offset 1.764 + 0
Add 2023-10-03 11:31:19         1                640 (  640.00 B) 8T (0 dirs)
Warning: adjusting date from 2023-10-03 11:31:19 to 2023-10-03 11:31:20

1 +added, 1 -removed.

1.764 + (640 -> 0 -> 592) = 2.356 @ 39.06 KB/s

0.016 seconds (000:00:00) (all OK)

So data loss already happened here, (what's stored in the archive is not what's in the file in the directory) but nothing was reported. Then...

>rem ---- 2.2 LIST --- 

>..\zpaqfranz l ..\e.zpaq 
zpaqfranz v58.11f-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-03)
franz:-noconsole 
../e.zpaq: 
2 versions, 2 files, 2.356 bytes (2.30 KB)

- 2023-10-02 00:53:11                 640 A     messageB

                  640 (640.00 B) of 640 (640.00 B) in 1 files shown
                2.356 compressed 

0.000 seconds (000:00:00) (all OK)

Data loss is present, still nothing was reported.

 >rem ---- 2.3 LIST checksum --- 

>..\zpaqfranz l ..\e.zpaq -checksum 
zpaqfranz v58.11f-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-03)
franz:-checksum -noconsole 
../e.zpaq: 
2 versions, 2 files, 2.356 bytes (2.30 KB)

- 2023-10-02 00:53:11                 640 A     XXHASH64: E9127211F02DB4EE CRC32: 92433266 messageB

                  640 (640.00 B) of 640 (640.00 B) in 1 files shown
                2.356 compressed 

0.000 seconds (000:00:00) (all OK)

Still nothing was reported. Moreover, the program lied in this output: the checksum CRC32: 92433266 is NOT the checksum of the messageB but of the messageA. We know messageB was not stored at all, and fragment from which that crc32 was made doesn't exist in the archive. And we still don't know that something is wrong.

Finally:

>rem ---- 2.4 LIST ALL --- 

>..\zpaqfranz l ..\e.zpaq -all -s-1 
zpaqfranz v58.11f-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-03)
franz:-all                                      4
Unknown option ignored: -s-1
franz:-noconsole 
../e.zpaq: 
2 versions, 2 files, 2.356 bytes (2.30 KB)

- 2023-10-03 11:45:29                   0       0001| +1 -0 -> 1.764
- 2023-10-02 00:53:08                 640 A     0001|messageA
- 2023-10-03 11:45:30                   0       0002| +1 -1 -> 592
-                                       0       0002|messageA
- 2023-10-02 00:53:11                 640 A     0002|messageB

                1.280 (1.25 KB) of 1.280 (1.25 KB) in 5 files shown
                2.356 compressed 

################################################################################################
87571: Restoring this file will have incorrect data due to suspected SHA-1 collision(s)
<<0002|$1messageB>>
################################################################################################

0.000 seconds (000:00:00) (all OK)

The data loss made in 2.1 is recognized! :+1: Tangentially, I don't know what the meaning of $1 in the <<0002|$1messageB>> is, and also why << >> are there.

Still, good progress! :+1:

ghost commented 11 months ago

(Please use the last edit in the previous, or this script:

mkdir ex2
cd ex2
copy ..\messageA .
..\zpaqfranz a ..\e.zpaq *
..\zpaqfranz l ..\e.zpaq -all -s-1
del messageA
copy ..\messageB .
rem ---- 2. adding directory with only messageB ---
rem ---- 2.1 ADD ---
..\zpaqfranz a ..\e.zpaq *
rem ---- 2.2 LIST ---
..\zpaqfranz l ..\e.zpaq
rem ---- 2.2 LIST checksum ---
..\zpaqfranz l ..\e.zpaq -checksum
rem ---- 2.4 LIST ALL ---
..\zpaqfranz l ..\e.zpaq -all -s-1

)

fcorbelli commented 11 months ago

(1) cannot detect, during a (add), a collision on different versions (2) cannot detect, during l (list), a collision on different versions (3) CAN, during list -all, detect a collision on different versions (4) << and >> are here because I want to (cannot be part of a filename) (5) the $1 is a placeholder internally used "can't" means "without a lot of slowdown" => I will leave in the a (add) small slowdown I will strip from l I will make another command collision to quickly check against collisions

ghost commented 11 months ago

That's a good argument for the "check data loss" command. Intuitively, if there's more than "just plain l" (any verification that anyway takes more time, I think it should also be there). I also like the detection in "a". :+1:

fcorbelli commented 11 months ago

58_11h.zip The attached pre-release

ghost commented 11 months ago

Looks good.

ERROR in rebuilded

Should be "rebuilt", but even clearer would be: "ERROR: Recalculated ... doesn't match stored ..." then give the matching filename, without << >> These characters are just making it harder to process the output programmatically. If the filename is the only thing in the line, the prefix for the version is anyway already same for list -all so it's clear.

But overall, I think it's a good feature, now that the examples like messageA and messageB exist.

fcorbelli commented 11 months ago

then give the matching filename, without << >> These characters are just making it harder to process the output programmatically

This makes searching for filenames, for debug reasons, much easier. > and | cannot be part of filenames, so they will stay

ghost commented 11 months ago

I'm really glad that you are going to add that detection in the next release. :+1: Then the description of zpaqfraz will fit better: "Deduplicating archiver with encryption and paranoid-level tests." :+1: