A test with the files which actually collide

ghost commented 11 months ago

Attached is a zip file with two files which have the same sha1 but different sha256.

I'm trying to add the folder with them to the zpaqfranz archive, and I always see only one of them stored under the two names, not two of them:

C:\tmp\collision>unzip collision-example.zip
Archive:  collision-example.zip
   creating: baad/
 extracting: baad/messageA
 extracting: baad/messageB

C:\tmp\collision>openssl dgst -sha256 baad\*
SHA256(baad\messageA)= 3ead211681cec93d265c8ac123dd062e105408cebf82fa6e2b126f4f40bcb88c
SHA256(baad\messageB)= 208feafe1c6a95c73f662514ac48761f25e1f3b74922521a98d9ce287f4a2197

C:\tmp\collision>openssl dgst -sha1 baad\*
SHA1(baad\messageA)= 8ac60ba76f1999a1ab70223f225aefdc78d4ddc0
SHA1(baad\messageB)= 8ac60ba76f1999a1ab70223f225aefdc78d4ddc0

C:\tmp\collision>zpaqfranz.exe a baad.zpaqf baad
zpaqfranz v58.10o-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-01)
franz:-noconsole
Creating baad.zpaqf at offset 0 + 0
Add 2023-10-02 03:39:47         2              1.280 (   1.25 KB) 8T (1 dirs)
3 +added, 0 -removed.

0 + (1.280 -> 640 -> 1.842) = 1.842 @ 26.60 KB/s

0.047 seconds (000:00:00) (all OK)

C:\tmp\collision>mkdir result

C:\tmp\collision>cd result

C:\tmp\collision\result>..\zpaqfranz.exe x ..\baad.zpaqf
zpaqfranz v58.10o-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-01)
franz:-noconsole
../baad.zpaqf:
1 versions, 3 files, 1.842 bytes (1.80 KB)
Extract 1.280 bytes (1.25 KB) in 2 files (1 folders) / 8 T

0.031 seconds (000:00:00) (all OK)

C:\tmp\collision\result>openssl dgst -sha256 baad\*
SHA256(baad\messageA)= 3ead211681cec93d265c8ac123dd062e105408cebf82fa6e2b126f4f40bcb88c
SHA256(baad\messageB)= 3ead211681cec93d265c8ac123dd062e105408cebf82fa6e2b126f4f40bcb88c

Obviously one content under two names is there.

I've read "additional checks" are default, it's unexpected. Maybe the switch is needed:

C:\tmp\collision\result>cd ..

C:\tmp\collision>zpaqfranz.exe a baad-2.zpaqf baad -sha256
zpaqfranz v58.10o-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-01)
franz:-sha256 -noconsole
Creating baad-2.zpaqf at offset 0 + 0
Add 2023-10-02 03:41:54         2              1.280 (   1.25 KB) 8T (1 dirs)
3 +added, 0 -removed.

0 + (1.280 -> 640 -> 1.982) = 1.982 @ 26.60 KB/s

0.047 seconds (000:00:00) (all OK)

C:\tmp\collision>mkdir result-256

C:\tmp\collision>cd result-256

C:\tmp\collision\result-256>..\zpaqfranz.exe x ..\baad-2.zpaqf
zpaqfranz v58.10o-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-10-01)
franz:-noconsole
../baad-2.zpaqf:
1 versions, 3 files, 1.982 bytes (1.94 KB)
Extract 1.280 bytes (1.25 KB) in 2 files (1 folders) / 8 T

0.032 seconds (000:00:00) (all OK)

C:\tmp\collision\result-256>openssl dgst -sha256 baad\*
SHA256(baad\messageA)= 3ead211681cec93d265c8ac123dd062e105408cebf82fa6e2b126f4f40bcb88c
SHA256(baad\messageB)= 3ead211681cec93d265c8ac123dd062e105408cebf82fa6e2b126f4f40bcb88c

collision-example.zip

fcorbelli commented 11 months ago

In the attached experimental file you will see a zpaqfranz immune to SHA-1 collisions

BUT

no more backward compatible exp01.zip

ghost commented 11 months ago

Personally, I would not want to sacrifice backward compatibility, and it's hard to believe that it would be impossible to keep it:

Intuitively, to have "immunity" you just have to store the file or fragment for which you know that the two different checksums don't match -- which just means that you have to avoid "deduplication", nothing more. I can't imagine that the original format would not allow that.

fcorbelli commented 11 months ago

Personally, I would not want to sacrifice backward compatibility,

Neither me 😄

and it's hard to believe that it would be impossible to keep it:

It is

Intuitively, to have "immunity" you just have to store the file or fragment for which you know that the two different checksums don't match -- which just means that you have to avoid "deduplication", nothing more. I can't imagine that the original format would not allow that.

You do NOT know that don't match UNTIL you already store in the archive the colliding files, whose fragments can be referenced from another "legit" one

Therefore...

0) you cannot "grow" the space for the hash (20 bytes, that's all), neither use another one (just like EXP01) 1) you have to create another fake transaction logically deleting the "collisionated" files (wasting space) 2) cannot (easily) turn off the deduplicator, unless you want to make a new hacked-entry-file (a lot of work). If the file is big, you "trash" a lot of space 3) you must READ AGAIN all colliding files, XORING (!!!) to "trick" SHA-1 collision, store again, and keep a new hacked entry file (!!!) to be XORted again when restored. The result is just the same (storing every byte, no deduplication at all)

A really hell of a work, and reaaaalllyyyy slow

EXP01 does not use SHA-1, but XXH3 (+ optional CRC-32) as hasher Everything else is the same

fcorbelli commented 11 months ago

The compatibility problem concerns the inability to store a different code for each fragment. For example, a CRC-32. In this case you would have a different equality condition than now

Today

if

SHA-1(frammento1)==SHA-1(frammento2) => frammento1==frammento2

There is no space, in the archive formato, to store (for example) 4 more bytes for each fragment, to do

if

(SHA-1(frammento1)==SHA-1(frammento2))

AND

(CRC-32(frammento1)==CRC-32(frammento2))

=> frammento1==frammento2

Increasing the size from today's 20 bytes would make zpaq totally incompatible. At that point we might as well increase it to 32, or better yet to fixed but variable length (a couple of bits to specify its length in bytes, e.g. 20,32,36,64), and change hash algorithm altogether (e.g. SHA-256, SHA-256 with CRC32, SHA-256+SHA-3 (!))

ghost commented 11 months ago

There is no space, in the archive formato, to store (for example) 4 more bytes for each fragment, to do

You don't need to store more than 4 bytes for each fragment, you can store "real" sha1 for all normal fragments and you can store an adjusted sha1 value for the conflicting fragment. The conflict can be known during the process of the archive creation by maintaining additional checksum, and that is what your program already does?

Edit: wait not 4, sha1 is 20 bytes. So is the format already using 20 bytes for each fragment's sha1 or not? If it does, like I've said then the "adjusted" value can be stored for the conflicting fragment.

ghost commented 11 months ago

is the format already using 20 bytes for each fragment's sha1 or not? If it does, like I've said then the "adjusted" value can be stored for the conflicting fragment. If the list of fragments that make the file already uses some other way to point to each fragment, then one doesn't have to "adjust" sha1 of that fragment anyway

fcorbelli commented 11 months ago

There is no space, in the archive formato, to store (for example) 4 more bytes for each fragment, to do

You don't need to store more than 4 bytes for each fragment, you can store "real" sha1 for all normal fragments and you can store an adjusted sha1 value for the conflicting fragment. The conflict can be known during the process of the archive creation by maintaining additional checksum, and that is what your program already does?

No. zpaqfranz keeps CRC-32 for every file, not for fragment. There is no space to store anything in fragments, because they are then clustered in blocks. I can "steal" maybe one byte from a block, but this can be made of different fragments

fcorbelli commented 11 months ago

is the format already using 20 bytes for each fragment's sha1 or not? If it does, like I've said then the "adjusted" value can be stored for the conflicting fragment. If the list of fragments that make the file already uses some other way to point to each fragment, then one doesn't have to "adjust" sha1 of that fragment anyway

??? Yes, there is SHA-1 for each fragment

Cannot understand your suggestion

ghost commented 11 months ago

Let's say I want to extract one file from the stored archive. The file is e.g. 200 kb, and the archive 2000 MB. I can extract just that one file without reading all 2000 MB, correct? How do I do this? I get the file name, I find the stored list of fragments that corresponds to the stored file, and I read them in order, extract them and there is the extracted file. My question was how is the list of fragments stored? If the list of fragments of the file is a list of 20-bytes sha1s, then the conflicting fragment has to get a different sha1 value and be stored with that different value, and that different value also stored in the list of fragments that make a file. Then the extraction would still be able to extract the said file, even if its "real" sha1 is the same as some other fragment somewhere else.

ghost commented 11 months ago

The conflict detection is done in memory, but later it only matters how the file is stored. If the list of fragments is not a list of the sha1 values but simply a list of fragement IDs then it's even much simpler and much more compatible with the format. We then don't care what is the sha1 of the fragment, the only important information is the fragment ID, which has to appear in the file list.

fcorbelli commented 11 months ago

Wikipedia to the rescue! 😄 https://en.wikipedia.org/wiki/ZPAQ#Archive_format

Look at journaled format

fcorbelli commented 11 months ago

The conflict detection is done in memory, but later it only matters how the file is stored. If the list of fragments is not a list of the sha1 values but simply a list of fragement IDs then it's even much simpler and much more compatible with the format. We then don't care what is the sha1 of the fragment, the only important information is the fragment ID, which has to appear in the file list.

ahahaah (Homeric laugh) No Because if the content of the fragment has a different SHA-1, from what is added to it, zpaq will notice it and shoot an error. You cannot store a "fake" fragment with what you want inside. The SHA-1 must match

fcorbelli commented 11 months ago

The fragment list is made of INTEGERS, nothing more (4 bytes) BUT it is packed (compressed and "blocketized") togheter with file entry (and other things). Therefore, when add is concluded, and you NOW can detect a suspected collision, everything is thightly packed and stored inside the archive Changing it is not at all easy, not least because there is an entry for a particular initial block with the length offset (it is used to "jump" over blocks of data, in effect with a seek). An actual new transaction should then be added.

fcorbelli commented 11 months ago

A bit more http://mattmahoney.net/dc/zpaq206.pdf

ghost commented 11 months ago

Thanks for the original documents, but where do you store additional info that is seen when zpaqfranz is used, like the additional checksums you have added? And I believed that you update your additional file checksums at the same time zpaq calculates sha1 of each fragment? Do you do it differently?

fcorbelli commented 11 months ago

Thanks for the original documents, but where do you store additional info that is seen when zpaqfranz is used, like the additional checksums you have added? And I believed that you update your additional file checksums at the same time zpaq calculates sha1 of each fragment? Do you do it differently?

I hacked the attributes 😄

The key is this

Type i blocks describe edits to the central filename index as processed in order from the beginning of
the archive(...)
The attr string may be of any length. The meaning is defined if it has one of the following prefixes:
“w” windows_attr[4]
“u” unix_attr[2]

zpaq store variable-sized attributes:on Windows 4 bytes, on unix 2 bytes... on zpaqfranz 50, 76 or 550 bytes (+8) This is the "magic": zpaq, when dealing with zpaqfranz archive, skip over the "injected" data, ignoring CRC-32, hashes, other date etc

zpaqfranz does NOT recompute on fragments, BUT directly from the filesystem

zpaq(franz) read a buffer (say 4KB for example), works on it, deduplicating aka fragmenting. zpaqfranz then update a global-file-wise hash (XXHASH, SHA-256 or what you want) and a global file-wise CRC-32 Then store, as an hacked attribute, global hash and global CRC-32 for every file

During 1st phase test zpaq decompress every fragment [checking SHA_1], zpaqfranz compute CRC-32 of such fragment In 2nd phase (of zpaqfranz) the fragments are sorted and the global CRC-32 (for every file) is calculated FROM THE FRAGMENTS Then, if CRC-32 from fragments match "global CRC-32", the file is good Otherwise something is bad

This is possible because CRC-32s are "combinables", you can compute a global CRC-32 from chunks (there is a problem with holes, aka sequence of zeros not stored by zpaq, that rely on fseek to zero-fill output file). Impossible for crypto-hashes, like MD5, SHA-1 or whatever

In short, not exactly super simple 😄

fcorbelli commented 11 months ago

And now the collision detector: zpaqfranz transform fragment lists (aka: integers sequence) in a XXHASH value, inserting in a map If the hash value already exists, zpaqfranz does a binary compare (of the fragment list) If they are == then check file-level CRC-32s. If they are the same, good. If different => SUSPECTED collision 😃

In fact I can do either an HASH check (if same fragments, but different SHA-256...) But I need more code, because every file can get a different hash algo Therefore I need to check if the algos are the same, in this case use the hash If the algos are different, then step back to CRC-32 etc But I am too lazy 😸

All of this "things" AFTER, not DURING update.

ghost commented 11 months ago

OK. After I've checked the zpaq format description, to avoid saying something inconsistent, I hope I'm summarizing the whole discussion correctly then with:

For the collision detection only the crc32 hashes of the whole files are compared, and it is impossible to do more than only detect the problems on the level of whole files. The problem with the zpaq format is that one would need to store e.g. crc32 hashes for every fragment and not only once per file, to be able to reasonably efficiently detect the problem on the level of every fragment and then force storing the fragment which in the original zpaq would surely not be stored because of the sha1 collision. That additional info has to be stored so that it would not make problems to the original zpaq, and it doesn't seem there's "right" space for that in the format. So what you implemented (the detection on the level of the whole files), unless some solution which we aren't aware of exists, is the "best it can be done" as long as the format should remain compatible, which is anyway the goal of zpaqfranz (otherwise it should not have the name starting with zpaq).

So only detection is realistic, only on the level of the whole files, only if the additional checksums exist (which are, at least, default when zpaqfranz produces the archives).

I hope I haven't missed anything. Thanks.

fcorbelli commented 11 months ago

The reconstruction is quite accurate. I will add a couple of details It is possible to create dummy data fragments, that is, linked to dummy files By dummy file I mean a file that is related to a Windows ADS, what I call in zpaqfranz a "virtual file" Translation zpaq, by default, does not show ADS files I use this behavior to store, within "dummy" files, additional information, which the zpaq format cannot maintain (e.g., a list of files)

So far so good BUT to maintain backward compatibility, it is NOT enough to invent a method for storing data that zpaq does not "see," but rather (in this case) you need one that zpaq DOES see. That is, a file stored according to classical rules, i.e., those of zpaq, not the virtual files of zpaqfranz, that can later be extracted by zpaq Otherwise, to go concrete, only zpaqfranz would be able to extract a SHA-1 collision file (assuming I develop the whole system to have it stored "covertly", an hard issue indeed)

Does it sound complex? Yes, it is, very much so,otherwise I would have already been doing it for years 😄

ghost commented 11 months ago

to go concrete, only zpaqfranz would be able to extract a SHA-1 collision file

If I'd want to solve a problem manually: if I'd want to store the file in any system which uses sha1, and if I knew that that file would collide, I'd just append in front of the file's content 32 bytes. 8 bytes would be a string "sha1coll\0" 16 bytes would be a fixed uuid which I would generate only once for eternity, and the last 8 bytes would be a current timestamp. Then the content of the file would follow. As soon as I'd store such modified file, it would have a different sha1, it would be surely stored and expanded back. After the extraction, I could easily recognize that it has these 32-bytes extra and remove it. I wouldn't even bother to implement automatic removal on extraction, as the whole scenario would anyway happen extremely rarely, and I would know that the archive still contains all the original bytes (but prefixed with 32 bytes more). In that way, storing the content of the file, to prevent complete loss of the data, is not too big problem. I would also not care that the zpaq would extract 32 bytes in front of the content of the file. I could always strip them if I need the original.

And I would definitely not want the action of storage of such modified file to be a new "version" of the archive made without my control. Imagine if the archive is already big -- one more version could be too much. So thinking about the subject more, leaving the handling of such a file independently of the archiving program is from my perspective "good enough", it's only the detection of the data loss that is nice to have. And once detected, manual intervention on the file, as described, or any other way, should also be good enough because we anyway don't expect the collisions to happen by chance, only as a product of human intervention.

I now think the detection is good enough. Thanks.

fcorbelli commented 11 months ago

If I'd want to solve a problem manually: if I'd want to store the file in any system which uses sha1, and if I knew that that file would collide, I'd just append in front of the file's content 32 bytes. 8 bytes would be a string "sha1coll\0" 16 bytes would be a fixed uuid which I would generate only once for eternity, and the last 8 bytes would be a current timestamp.

Well, no 😄 It's much more complex than that The zpaq deduplicator does not operate on files, but rather on parts of files, on fragments, by means of an appropriate function that (I'll spare you the explanation) "breaks" individual files into pieces. So your method would not work. You have to operate with much more "brutal" systems, such as applying an XOR on each byte of the source file [to be sure not collide anymore], than XOR back in extraction (but we cannot, zpaq will not XOR-back the file)

Then the content of the file would follow. As soon as I'd store such modified file, it would have a different sha1, it would be surely stored and expanded back. After the extraction, I could easily recognize that it has these 32-bytes extra and remove it.

Well no No, because standard zpaq don't know something about this new "tricky file" The user will get two different file, say messageB (wrong) and fixed_messageB after extraction

I wouldn't even bother to implement automatic removal on extraction, as the whole scenario would anyway happen extremely rarely, and I would know that the archive still contains all the original bytes (but prefixed with 32 bytes more). In that way, storing the content of the file, to prevent complete loss of the data, is not too big problem. I would also not care that the zpaq would extract 32 bytes in front of the content of the file. I could always strip them if I need the original.

My method, the one I described above, is the "right" version of yours. With additional transactions you can mark as deleted (the wrong file, in n+1 version) and therefore the "right" file (in n+2 version) will be extracted from zpaq, without having to change its name At least, in theory 😃

And I would definitely not want the action of storage of such modified file to be a new "version" of the archive made without my control. Imagine if the archive is already big -- one more version could be too much.

No, at all. Because a version is something very common with zpaq. In fact, it is THE reason to use it. The order of magnitude is a few KB more. It's very common to get 1.000 or 5.000 versions inside a zpaq archive

So thinking about the subject more, leaving the handling of such a file independently of the archiving program is from my perspective "good enough", it's only the detection of the data loss that is nice to have. And once detected, manual intervention on the file, as described, or any other way, should also be good enough because we anyway don't expect the collisions to happen by chance, only as a product of human intervention.

I now think the detection is good enough. Thanks.

I'm glad I've explained the "why" of the choices behind it Just the method of detecting SHA-1 alone is at least complex and took me a lot of work. Finally, there is THE PROBLEM, namely the performance Reading the list of files in a zpaq archive (the whole list, i.e., with the -all switch) can be long, very long. Taking even minutes, for very large files. So checking that the foo.txt file, from 3 years ago, had the same SHA-1 fragments as the pluto.txt file, from today, is not at all easy, if you want to get it in seconds. In the first implementation of the detector, in fact, it operated only on the latest version, and not on all The collision command, on the other hand, reads everything, but clearly slower

fcorbelli commented 11 months ago

Short version: try this one

zpaqfranz a test.zpaq messageA messageB

(for now manually get messageB is wrong) Re-add to the archive, ordered (-stdout), imposed (-force), not fragmented (-nodedup)

zpaqfranz a test.zpaq messageB -stdout -force -nodedup

Now extract messageB to messageB_ok (zpaqfranz) and _715 (standard zpaq) and check all SHA-256

zpaqfranz x test.zpaq messageB -to messageB_ok -space
zpaq64 x test.zpaq messageB -to messageB_715
zpaqfranz sum message* -sha256

I think I will try to implement over the weekend However, there remains the problem, which I don't know how to overcome, of the at least theoretical need to re-read all dt

ghost commented 11 months ago

So your method would not work.

You are right, it would not if an "attacker" would produce new collision specially to confuse zpaq in a way to make sure that even once the file has these 32 bytes added the zpaq produces the same later fragments and the collision happens only there. It seems to me that then planing the "default protection" from such an attacker is waste of time then, and is one more argument to just keep detection, but not do any automatic store.

fcorbelli commented 11 months ago

Please try my previous post

ruptotus commented 11 months ago

Hmm... is Your discussion means that I should stop using zpaq to backup my data? Or collisions are really rare in real world?

fcorbelli commented 11 months ago

Hmm... is Your discussion means that I should stop using zpaq to backup my data? Or collisions are really rare in real world?

They are much more than rare, for normal files They are purpose-built. Recall that zpaqfranz, with the t (test) command, already performs the SHA-1 collision test So, if you want to be on the safe side, you can just do it

ghost commented 11 months ago

So I tried a simpler example than yours and I see that

 zpaqfranz a test baad/messageA baad/messageB  -nodedup

packed both files, in a way that now

 del /q baad\*
 zpaq x test

Unpacks the correct messageA and messageB. So in this case your -nodedup was enough, and it was compatible with zpaq? Can you explain how?

fcorbelli commented 11 months ago

Because you really want the deduplicator on And you will not know that messageA and messageB will collide, until you do a full run (add)

Please check the attached (very rough) pre-release 58_11p.zip

zpaqfranz a test.zpaq messageA messageB
zpaqfranz x test.zpaq -to good\
zpaq64 x test.zpaq -to goodtoo\

Versus

zpaq64 a wrong.zpaq messageA messageB
zpaq64 x wrong.zpaq -to notgood\

fcorbelli commented 11 months ago

In you want other test files... collisions.zip

ghost commented 11 months ago

I think that as long as you don't do collision detection on the level of a single fragment but on the level of the file, it would be still possible to confuse your detection that you do now (over the whole file) with these "attacks through the fragments" that you also mentioned? Imagine:

fileX content:   longXonlyFragment - ThePointWhereFragmentsAreSplit - theCollisionFragmentA
fileY content:   longYonlyFragment - ThePointWhereFragmentsAreSplit - theCollisionFragmentB

Now both the files have from the start all file level checksums different anyway. From the file level comparisons nothing suspicions can be detected. The collision and the data loss happens only on the fragment level. That's why I don't think it has sense to plan automatic storage as the response to the collisions when anyway not all collisions of all the fragments are detected.

fcorbelli commented 11 months ago

I think that as long as you don't do collision detection on the level of a single fragment but on the level of the file, it would be still possible to confuse your detection that you do now (over the whole file) with these "attacks through the fragments" that you also mentioned? Imagine:
fileX content:   longXonlyFragment - ThePointWhereFragmentsAreSplit - theCollisionFragmentA
fileY content:   longYonlyFragment - ThePointWhereFragmentsAreSplit - theCollisionFragmentB
Now both the files have from the start all file level checksums different anyway. From the file level comparisons nothing suspicions can be detected. The collision and the data loss happens only on the fragment level. That's why I don't think it has sense to plan automatic storage as the response to the collisions when anyway not all collisions of all the fragments are detected.

I'm not very worried about this If I have a SUSPECTED SHA-1 collision, via CRC-32, I store the file again, in one piece If the file is very large, I will have a waste of space. Since I actually expect collisions to exist only on specially prepared files, I don't think it will happen in practice Therefore the resulting archive will be larger than it would be if there was collision management at fragment level, but certainly every type of "trick" will be bypassed

The problem would arise in the case of a double collision: same SHA1 on the fragments, same CRC-32 in the entire file In this case I could evolve with the check of the true hash, i.e. the XXHASH (or SHA-256 or whatever you want) But it really seems like an atomic bomb to kill a mosquito

ghost commented 11 months ago

If I have a SUSPECTED SHA-1 collision, via CRC-32, I store the file again, in one piece

My argument is that as soon as you assume that the attacker is attacking zpaq or zpaqfranz as a program you can't then argue that your file-level check will work. You have to do a cryptographic correct solution simply because of your attack model.

On another side, the checks you already implemented already reduce the chance of accidental data loss (that's a different attack model: of the kind "I have everything in the archive -- ups I don't and there was no message about that") for which we know, by the very presence of these already constructed files, that is clearly possible.

fcorbelli commented 11 months ago

zpaqfranz is not some kind of security software "Attacking" zpaqfranz... why?

It is a program that is used locally, like 7z What's the point of corrupting an archive? Could I upload multiple files with SHA-1 collisions to a server that uses zpaq to backup? Yes, and then? Nothing special happens The backup is done, AT MAX the "rogue files" cannot be restored. The "regular one" will not get "infected" (better, affected) Cannot stop zpaq's restore via SHA-1 collisions That's all

I am therefore more concerned about a possible "real" collision, i.e. relating to real world data, rather than an attack. This is why I have implemented SHA-1 DETECTION for a long time
Do not really care on "attacks"

As mentioned it is not possible to correct the collision issue, except by losing backwards compatibility If this is tolerated, as I demonstrated above with EXP01 I can do it in a few minutes But it's not worth it

ghost commented 11 months ago

I think we can agree: my guess is that the file level checks are enough for to allow one to not lose the file even if it's intentionally constructed to have the same sha1. We have already such files.

My argument was that as soon one brings to the discussion "second or later fragment" which would by the zpaq rules of fragments have the same sha1, then it's about a different goal, is not about handling what we already have -- files with that property (in collisions.zip) -- but about an imagined special attack that depends on the zpaq rule of fragments specifically. And based on your description of how you do the checks, such an attack would still succeed. But I also say it's not worth spending time on it, like you say: ""Attacking" zpaqfranz... why?"

I also don't think it should be about more than files "only" constructed to have same sha1. For such scenarios, my "fix" with a "prefix" would do solve the storage. And I also suggested that any attempt to automatically (without user doing anything) store the second file, in that context, is probably also already too much.

I think we are almost agreeing.

fcorbelli commented 11 months ago

58_11q.zip The attached pre-release get

new command collision (aka: zpaqfranz collision 1.zpaq)
new switch -collision for t (aka: zpaqfranz t 1.zpaq -collision zpaqfranz t 1.zpaq -collision -all)
new switch -collision for a (zpaqfranz a 1.zpaq messagea messageb -collision)
new switch -collision for l (zpaqfranz l z:\1.zpaq -collision zpaqfranz l z:\1.zpaq -collision -all)

the -all will extend detection to all versions (slower, but sure)

Adding -collision to the add command allows files to be extracted, correctly, even with zpaq and not just zpaqfranz (at least in theory, I haven't done very extensive testing)

ghost commented 11 months ago

new command collision (aka: zpaqfranz collision 1.zpaq)

I assume the new command just tests for collisions without having to list all the files? Is it equivalent to l -collision -all ?

Is collision test also off by default in "t"? Ithought it's "cheap enough" compared to all other "test" operations to be always on there (and doing "-all")?

Thanks.

fcorbelli commented 11 months ago

1) yes. It is just a bit faster (not very much) 2) yes, because it is already done. It is a duplicate

tansy commented 8 months ago

If it's almost broken, Wouldn't be better to use sha2 (256)? It's more unique as it's longer and potentially faster as Intel, and AMD, included it in the hardware, and potentially other processor platforms. Hash itself is bigger, true, but test would be equivalent to 4 64-bit integers, instead of 3 in sha1, and even hashing isn't that slow as the compression itself. From my test it's as fast as gzip decompression.

I know it would be new format but maybe that's the way to go - to make a new format - better designed and possibly simpler where it can be simpler, and with better features and so on.

fcorbelli commented 8 months ago

zpaqfranz already use HW accelerated SHA-256 (if any)
The hash is longer, therefore torning to pieces all zpaq's file format So no, I'll keep SHA-1 with CRC-32 detection, more then enough

tansy commented 8 months ago

So you use two different (sha1 and sha2) hashes?

fcorbelli commented 8 months ago

For data deduplication SHA1 is always used For SHA-1 collision CRC-32 is used too (on by default in zpaqfranz) You can store the full HW-accelerated (if any) full-file SHA256 (or even SHA3)

zpaqfranz a z:\pippero *.exe -sha256
zpaqfranz l z:\pippero -checksum

but this will not "save" against SHA-1 collision
In the next release I am implementing the ability to store arbitrary data, so essentially also implement a global collision checking system. But, frankly, it seems unnecessary to me, and so I think I will abandon it. Too much effort, too slow, too much complexity for a problem that I don't have

tansy commented 8 months ago

Well, 160 bits of sha1 plus 32 bits of crc32 is 192 bits. Using sha256 can only make sense if it replaces sha1 as it's not faster (even accelerated is just as fast) and formally* bigger than sha1. It could even be better and faster as it would eliminate additional crc32 computation.

formally, as 160 bits is 3-64 it integers, plus 32 bits of crc32 gives 4 comparisons, while sha256 is 4 64-bit integers, which also gives 4 comparisons.

fcorbelli commented 8 months ago

It is impossible, because the 20-bytes-long SHA-1 are stored inside a specific block type (the h) for every fragment, CRC-32 and XXHASH/SHA-256/BLAKE3 or whatever are stored together with the file name (that's full-file) inside i-block type

You can disable CRC-32 computation (and hashing too) with the -nochecksum switch (or -715). The speed difference is almost zero, and no SHA-1 detect-collision is possible

The real limitation is the monothreaded deduplicator: a possible multithreaded development would make it faster, but less efficient

You can see here how hard is "hack" the file format https://encode.su/threads/4178-hacking-zpaq-file-format-(!) And here 20-bytes SHA-256 https://encode.su/threads/3605-Are-the-first-20-bytes-of-a-SHA256-safer-cryptographically-than-SHA1

ghost commented 7 months ago

a problem with holes, aka sequence of zeros not stored by zpaq, that rely on fseek to zero-fill output file

I've just recently seen why the mentioned feature exists, extracting the attached zpaq on Windows NTFS and seeing in Explorer the "Properties / Size on disk" of the resulting file. Really nice to know.

the mentioned zpaq file is inside of this zip here, it seems that github rejects "unsupported" formats

fcorbelli commented 7 months ago

There are other "hidden gems" as well.
For example, files that are difficult to compress (* according to the estimator implemented in zpaq) are not compressed at all, for the first 4 methods. For method 5, however, it still attempts to compress them
This makes it possible to raise the compression level even for a mix of already compressed/uncompressed files (ex. JPGs mixed with text source code) etc.
There is a (initial) study in zpaqfranz on grouping by type with different compression levels. But it is really complex, as there is a block "chopper" => suspended for now

Lennart00 commented 2 months ago

Hello,

The files from the collision example can be run through deduplication without data loss with the fragment 0 parameter

PS C:\Users\Lennart\Downloads\baad> zpaqfranzhw.exe -ssd -verbose -sha1 -hw -longpath -verify -m4 -fragment 0 a "./test.zpaq" messageA messageB
zpaqfranz v60.4c-JIT-GUI-L,HW BLAKE3,SHA1,SFX64 v55.1,(2024-07-13)
franz:-method                                   4
franz:-fragment                                 0
franz:-sha1 -hw -longpath -ssd -verbose -verify
WARNING converted RELATIVE to FULL path |messageA| => C:/Users/Lennart/Downloads/baad/messageA
WARNING converted RELATIVE to FULL path |messageB| => C:/Users/Lennart/Downloads/baad/messageB
INFO: getting Windows' long filenames
Integrity check type: SHA-1+CRC-32 + CRC-32 by fragments
Creating ./test.zpaq at offset 0 + 0
Add 2024-07-15 20:31:41         2              1.280 (   1.25 KB) 16T (0 dirs): -m46
MAX_FRAGMENT 8.128 (7.88 KB)
2 +added, 0 -removed.
                    0 starting size
                1.280 data to be added
                1.280 after deduplication
                2.231 after compression
                2.231 total size
Total speed 26.60 KB/s
IO buffer 1.048.576
====================================================================================================================
Do a verify()

./test.zpaq:
1 vers, 2 files, 3 frags, 3 blks, 2.231 bytes (2.18 KB)

Verify hashes of one version vs filesystem (multithreaded)
Total files 2 -> in 002 threads -> 2 to be checked
Scan done, preparing report...
--------------------------------------------------------------------------------------------------------------------
OK          SHA-1 : 00000002 of 00000002 (     1.25 KB hash check against file on disk)
--------------------------------------------------------------------------------------------------------------------
Total hashed bytes 1.280 @ negative B/s
no file errors tracked
Files  added +2
0.063 seconds (00:00:00) (all OK)

This works i guess by forcing a smaller and therefore different fragment size on the file for the zpaq algorithm and make it not run into the prepared collision.

Lennart00 commented 2 months ago

test.zip Contains the zpaq archive with both files created with the fragment 0 setting

fcorbelli / zpaqfranz

A test with the files which actually collide #82

if

if

AND