Advise support The incremental hash

ug802 commented 2 years ago

i have a disk with 4T files, many small files.

i wish when i hash, it will compare with old sfv file check if exist if hash right and add new hash of new file add into sfv.

ug802 commented 2 years ago

my english is poor，just like this： when hash old files→check whether exist in the old sfv files and whether hash right new files → add hash of new file into sfv then ： once hash Complete two goals

Thunderbolt32 commented 1 year ago

See "Workaround" below, if you don't care about details.

A Checksum/hash file like *.sfv stores the File Path and the Checksum, usually not the folder of which the checksums are calculate neither if specific files of the folder where cherry-picked. Users which are cherry-picking the files, of which a checksum file is calculated, would be angry, if RapidCRC would calculate checksums for maybe a lot a lot more files than needed.
Is a "new file" new, because it is a "new detected file path" or because it is a "new detected checksum result"?
- You can use the File Path for identification and then detect checksum changes (e.g. CRC32). (Usual way to deal with)
- You can use the strong Hash (e.g. Blake3) for identification and then detect file movements. (Identification by Hash is used on Content-Adressable-Storage like they are implemented by restic or kopia, but because of the risk of Birthday-Paradox only reasonable with strong hash algorithms and still avoided on Enterprise Systems like IBM).

I think since "checksum and file change-detection" can become complex, neither of them will become implemented. The only program that i Know and search for additional files (like MultiPar for Pararchive-Format) has a problem when dealing with a lot of small files or with a huge amount of data.

Checksum Storage Way	Can detect checksum missmatch?	Can detect missing files?	Tell you which files are without recognized checksum?	Can still work with randomly renamed files?	Comment
central Checksum-File	✔️	✔️	❌	❌	usual way
decentral in the File Name. (e.g. you always check all files of a folder)	✔️	❌	✔️	❌	also common way / as long as you preserve the checksum on file rename operations (so you have to avoid to rename by automatically tools)
decentral and sticky NTFS Streams (e.g. you always check all files of a folder)	✔️	❌	✔️	✔️	NTFS-Streams are only working as long as you're moving/storing files within NTFS Volumes.

Note: The latter two decentral storage Options are automatically recognized and checked, if RapidCRC is not verifiying a checksum-file (so only calculating "new" checksums for a file).

Workaround

You can calculate a new checksum file. Since it is only a text file, you can check that the lines of new and old checksum file are order-synchronized (If not: sort all lines alphabetically with a tool), and then a Text Comparison Program of your choice will tell you added / removed and different lines and thus new / deleted and missmatching files betwenn the two text files.

OV2 commented 1 year ago

Incremental hashing doesn't really fit that well into the concept of RapidCRC. I will most likely not add this.

OV2 / RapidCRC-Unicode

Advise support The incremental hash #176

Workaround