Parchive / par2cmdline

Official repo for par2cmdline and libpar2
http://parchive.sourceforge.net
GNU General Public License v2.0
706 stars 71 forks source link

Working on major Par2 changes. Name? #130

Open mdnahas opened 5 years ago

mdnahas commented 5 years ago

Hi everyone,

I wrote the specification for Par2 a long time ago. I'm working on the code for a new version of Par. It will include:

  1. Reed-Solomon encoding with James S. Plank's correction
  2. Tornado Codes by Luby

I've spent a week learning the code. I've written unit tests for some of the existing code. The tests should allow me to modify the code without breaking it. The unit tests should be run as part of "make check" but I don't know how to add them. (I've never learned Automake). Can anyone explain how?

I also plan on writing a diff tool that can compare Par files to make sure the packets are bit-for-bit identical. I'll use this to make sure that my changes haven't affected the program's output for version 2 of the specification.

I plan on adding a "doc" directory, which will contain the old Par2 specification and the new specification.

The Tornado Codes will need a predictable pseudo-random number generator. I expect I will use a version of Linear Congruential Generator.

The big question I have is: what do we name the next version and do we want to add a new file extension? At this moment, I plan on keeping all of Par2's packets and just adding new recovery packets. This will mean that par2 clients will still be able to verify the file, but will not be able to fix it. Unfortunately, par2cmdline currently silently ignores any packet type it does not recognize. So, existing users won't know why they cannot fix it. I would normally call the new specification Par2.1 or Par3, except the name "Par3" has been used by the developer of MultiPar. Perhaps we should call it "Par4"?

When we decide on a new name, I'll push a new branch and everyone can take a look at the spec/code.

Mike

mdnahas commented 4 years ago

Actually, I just remembered another use case: the GF2P8MULB instruction has a defined polynomial (performs GF8 multiplication with polynomial 0x11b).

I don't know whether GF8 multiplication can be used in GF16, but if there's ever any GF16 multiply instruction with a defined polynomial, it could be used.

Holy F**k! I just searched "instruction galois field" and found not only GF2P8MULB, but PCLMULQDQ. The second instruction does galois-field multiplication of two 64-bit values to get a 128-bit value.

This whitepaper tells how to do use the instructions to do operations on GF(128) with the generator 0x10000111B. (Pages 12 thru 16.) It looks like it takes 21 operations to do the multiply. (4 PCLMULQDQ, 6 shifts, 11 XORs.)
https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/carry-less-multiplication-instruction-in-gcm-mode-paper.pdf

So, I definitely think we should allow variously sized Galois Fields. I think clients will only be required to support a handful of (size,generator) pairs. Maybe (8, 0x11011B), (16, 0x1100B), and (128, 0x10000111B).

Also note that x86 has an instruction for computing CRC32C (though it's named CRC32), so there's another benefit of CRC32C over CRC32.

Agreed. CRC32C it is.

but I need the recovery data to be 64 byte aligned for efficient 512-bit SIMD.

Understood. I'll see what I can do. Maybe we can put the coefficient at the end of the packet. (NOTE: It looks like SSE instructions should be 16-byte aligned anyway, so this won't completely get rid of the problem.)

hash functions

Okay. I'll look deeper into BLAKE3 and KangarooTwelve. E.g., libraries for various languages, who is using them, licenses, size of output, etc.. My default will be to go with KangarooTwelve, since it is derived from SHA3 and is likely to get more attention.

animetosho commented 4 years ago

ARM also has an 8-bit CLMUL instruction. The problem with the CLMUL instructions are that there's no particularly efficient way to do reduction - the CLMUL can be used for the purpose, but I think requires a minimum of 2 multiplies for well chosen polynomials. This makes them somewhat unattractive from a performance standpoint, unless you must use 64/128-bit GF.
It may be possible to (ab)use CRC32 instructions for a (bit-reversed) GF32 reduction, but you'd lose efficiency with the multiply.

Alternatively, you could delay the reduction if you don't mind recovery slices consuming double the amount of RAM.

I'm not sure whether GF64 or GF128 is more efficient with a 64-bit CLMUL instruction. I think the former requires 6 multiplies per 128 bits, whilst the latter can be done in 5(?), but there's probably more other operations required for the latter.

Maybe we can put the coefficient at the end of the packet.

It makes much more logical sense at the beginning I'd think, as, if you're processing incrementally, you'd need to know it before processing the data. But if the aim is to support multiple GF widths, the exponent may need to be variable width, which could make any alignment attempt moot? (assuming you're looking to support up to GF128)
Personally, I wouldn't put much weight on it, but mentioned it if you're interested in trying...

NOTE: It looks like SSE instructions should be 16-byte aligned anyway, so this won't completely get rid of the problem

I'm probably misunderstanding you, but 64-byte aligned would also be 16-byte aligned?

mdnahas commented 4 years ago

Blake3 vs KangarooTwelve Code: Blake3, KangarooTwelve Papers: Blake3, KangarooTwelve

Single threaded, it looks like their speed is about the same - both are 8 to 10 times faster than MD5 using AVX-512 (the latest SSE). When run without SSE on a Raspberry Pi, Blake3 is faster. Blake3 can use multiple threads easier and gets crazy speeds when it does.

Both are in the public domain. Parts of KangarooTwelve's code are licensed with a BSD or GPL license.)

Blake3's default implementation is Rust. There is official C code, but it is single-threaded. Its C code compiles with GCC or MSVC. Python's hashlib adding Blake3. KangarooTwelve's reference implementation is Python, but it also comes in Rust and C. The C code is compiled with GCC; MSVC support is experimental.

I didn't find any tools that were using Blake3 or KangarooTwelve. They're both pretty new.

Blake3's output is 32-bytes and twice the size of MD5. KangarooTwelve's output can be any size.

I haven't yet tried to compile and run the code.

Conclusions: MD5 is "severely compromised", but I don't think our users relied on Par2 for security. We could keep MD5. Another option is a non-cryptographic hash, which are definitely faster, but their uniqueness properties are not well tested. So I think we can rule those out. Blake3 and KangarooTwelve both provide the uniqueness we need and are very fast.

I think either will work. Even the single-threaded non-SIMD implementations will be at least as fast as MD5. I'm mostly worried about correctness of their implementations and making sure they're available for a lot of languages / compilers. Blake3 isn't yet available in python; KangarooTwelve's MSVC support is experimental. I don't think either is a clear winner.

Does anyone else have a comment about keeping MD5 or trying one of these new faster hashs?

mdnahas commented 4 years ago

Sorry, did I write 64-byte alignment? I meant, 64-bit alignment. 8-byte. Strings would be padded with 0 to 7 bytes of '\0'.

I like the idea of putting the coefficient at the end of the packet. You have to read the whole packet before operating on it anyway.

As for GF8 vs. GF16 vs. GF64 vs. GF128, I think we can experiment and see what we like. Perhaps we can choose factors to make things faster. E.g., only use factors with a single bit set to 1.

I couldn't find the ARM instruction, but I did find this library for galois field arithmetic. It includes special cases for the ARM NEON instruction set for GF8 and GF16. GFErasure on BitBucket

mdnahas commented 4 years ago

I got a message from the Blake3 creator. He said that Blake3 was only released a few weeks ago!

He also said there isn't a python version yet. There's "very preliminary" messages about getting it into python's hashlib.

Lastly, I asked him what the suggested method was for making a smaller-than-32-byte hash from the Blake3 output. He recommended truncating the result.

I like Blake3, but there currently isn't a python version. I think we should assume we'll use KangarooTwelve and, if Blake3 shows up with a python version, we'll reconsider.

(I also wrote the KangarooTwelve team and will write when I hear from them.)

animetosho commented 4 years ago

Cryptographic hashes provide both security and a way to hash output, so they're generally more useful than non-cryptographic hashes, and hence, have wider adoption. The latter is only useful if speed is a primary concern, but cryptographic hashes can often be fast enough that this often isn't much of a problem.

The new crypto hashes don't have wide adoption yet, but, to be honest, by the time anyone tries to implement a new PAR client, that'll probably change. Hence I'd recommend these more since they're likely to gain adoption, whilst non-crypo hashes will likely stay a niche.

SHA1 and SHA256 acceleration exists on some x86 and ARM CPUs. ARM has a proposed SHA3 extension, I believe. SHA2/3 without hardware acceleration is fairly slow however.
I've always been against keeping MD5 as it's old, slow, insecure, and likely will never get widespread acceleration or attract much future development interest.

You have to read the whole packet before operating on it anyway.

If you're willing to skip the hash check, you could in theory operate on parts of the packet, though I'm not sure any implementation would work that way.

I couldn't find the ARM instruction

VMULL.P8 (ARMv7) or PMULL (ARMv8). PMULL can also support 64-bit multiplies with some cryptographic extension, but I couldn't easily find it in ARM's documentation.

I did find this library for galois field arithmetic. It includes special cases for the ARM NEON instruction set for GF8 and GF16.

That's an interesting library - may be an alternative to GF-Complete.

The GF8/16 NEON implementations are pretty much a straight port of the SSSE3 code (including AND operations that aren't needed on NEON). It looks like they have ARM V/PMULL implementations for all field sizes.

mdnahas commented 4 years ago

The KangarooTwelve ("K12") author got back to me.

It is being used by a cryptocurrency, Aeon.

It's available in C, C++, Go, Ruby, Rust and Python.

It seems better tested and more mature than Blake3. I have heard more from the Blake3 author and he's working on a python package. Still, K12 seems like a better way to go. I'll try to download and compile their code soon.

mdnahas commented 4 years ago

I've got a rough design! I'll try to work out the details and post a draft in the next few weeks.

The idea is to have basically 3 layers. The first layer provides redundancy for the data. The second layer does data integrity: checksums for each input file's data and then a checksum-of-checksums for the whole set of input files. The third layer does metadata integrity: checksums for each file's metadata (filenames, directory names, (optional) permissions and (optional) symbolic links) and a checksum-of-checksums for all metadata and data.

The redundant data layer is based around a single virtual file. Each input data file has a mapping of its blocks onto the virtual file. The code matrix says how to compute parity blocks for the virtual file.

I haven't worked out all the details yet, but the packets will go something like:

  1. FundamentalPacket - It holds the Galois Field and block size.
  2. SinglePassHintPacket - It is used to support single-pass processing. It holds the name of an input file and the mapping of the input file's blocks to their location in the single virtual file.
  3. MatrixPacket - It determines the value of elements in the code matrix. There will be multiple ways to set the values. To support LDPC, the values can be explicitly listed in the packet. For the Cauchy Matrix and random sparse matrix, the packet will say how to calculate the values.
  4. VirtualFileBlockPacket - It holds a block of data in the single virtual file. (This packet is used when storing data in the Par3 file.)
  5. VirtualFileBlockChecksumPacket - It holds checksums for blocks in the single virtual file. (This packet is used when storing data outside the Par3 file.)
  6. FileDataChecksumPacket - It holds a checksum for the input file's data. It also holds a copy of the mapping of the input file's blocks to their location in the single virtual file.
  7. FileDataIntegrityPacket - It holds the checksum-of-checksums for all input files's data.
  8. FileMetadataPacket - It holds metadata, like if it is a file or directory, name, (optional) permissions, and (optional) if it is a symbolic link.
  9. FileMetadataIntegrityPacket - It holds a checksum-of-checksums for all input files's data and for all metadata
  10. RecoveryBlockPacket - holds a block's worth of redundant data and what row of what code matrix was used to calculate it.
  11. CreatorPacket - same as in Par 2.0
  12. CommentPacket - same as in Par 2.0

I'm still working out the details, but I think this design will work.

I chose to have a single virtual file because it allows deduplication. If two input files share a lot of blocks in common, they can be mapped to the same place in the single virtual file. This approach also allows the user to protect only a portion of an input file, which is important if we want to stick Par 3 recovery data inside a ZIP file.

If a user wants to change a file's name, the only things that have to change are the FileMetadataPacket and the FileMetadataIntegrityPacket.

If a user wants to create a new archive that is a superset of an existing archive (for example, do an incremental backup), they should be able to do it by appending the new file data to the end of the single virtual file. It will require a second MatrixPacket. There are a few details I need to work out, but I'm pretty sure I can make it work. This functionality is also necessary for using Par 3 as a streaming protocol for forward error correction.

If we want to support compression, It isn't hard to add it to VirtualFileBlockPackets. There are some widely portable libraries, such as zlib, that we could use.

As far as metadata support, we should discuss that. I think we should definitely require support for empty directories. I think we should probably require support for read/write/executable bits and, maybe, symbolic links. Those basic permissions are common to all file systems now and they are commonly used in package distributions. I think we should not require anything that involves who owns the file - I'd be afraid of a security nightmare. That said, it would be nice to add an option to support all the metadata stored by the outdated "tar" file format.

I don't have all the details worked out yet, but I think this rough design should work. I like the single virtual file - it feels like the right direction. I'm currently looking at random number generators to use for the sparse random matrix. (PCG with a 128-bit state is the current leading candidate.) I also need to specify the algorithm for generating the sparse random matrix. I'd love people's input on:

animetosho commented 4 years ago

Thanks for keeping up with this!

I can't remember if I've asked this before, but I don't quite understand the two *IntegrityPacket packets. In PAR2, each packet has an embedded checksum, so wouldn't this already handle that use case?

I'm not sure how much you want to get into supporting archiving features. The main reason is that no-one has ever put input slice data into PAR2, even after all these years, so one does question whether anyone would bother implementing such features for future PAR. I don't mind it being in the standard as optionals, to enable it to compete more with formats like RAR I suppose.

Those basic permissions are common to all file systems now and they are commonly used in package distributions

...except anything FAT based, ISO9660 and a bunch of virtual file systems (e.g. s3fs). NTFS uses ACLs which is somewhat different from POSIX permissions, though you could perhaps map them in an odd way.
Still, many protocols/standards support either POSIX or Windows style permissions, so I don't see any problem with picking one, though keeping it optional is probably a good idea.
For the purposes of backups, I think supporting both makes sense (Windows user get ACLs, POSIX user gets POSIX permissions (and ACLs if their system is configured that way)), if you want to go that direction.

I think we should not require anything that involves who owns the file - I'd be afraid of a security nightmare

To me, if you're supporting permissions, you need to support ownership info as well. Permissions don't make sense without ownership info (and even less so with ACLs).
Ultimately, it's the client's job to do what it thinks best about permissions/ownership - it can always choose to ignore it if it wants to.

compression/encryption

If you want to potentially include these capabilities, you could include some support for 'data encoding' or 'filters' that get applied when storing/extracting data, and occur before redundancy is computed.

I'm not too sure supporting all archive features is necessarily a worthy goal. Just because it's supported doesn't mean it's supported well, e.g. I doubt future PAR will ever have good support for solid compression.

I also don't really like the idea of encryption as I don't think the format is well suited for security, for example, the use of hashes (potentially non-cryptographic) vs MACs. I suppose you could have it for some weak level of security if anyone sees any point in such.

Personally I feel compression is also out of scope (then again, I feel including a lot of the archiving features is being a little too ambitious), but if you have the itch, adding optional data encoding filters could enable support for it eventually.

Yutaka-Sawada commented 4 years ago

Hello, Mr. Nahas. I'm a developer of MultiPar. I will help you and support PAR3 plan.

I chose to have a single virtual file

This is good. It's possible to contain many files by using a few blocks. In PAR2, relatively small files (less size than block size) caused bad efficiency.

But, I'm afraid that the mapping method is required to be enough wise. I remember a problem in ICE ECC ago; a missing small file caused loss of 2 blocks, when the file is put between boundary of 2 blocks. That problem happened by appending all input files simpley to construct a virtual source file. If possible, it's good to avoid mapping small files over boundary.

Also, searching input file slices (on a damaged file) may become difficult slightly, when slice sizes are varied. In PAR2, a file is splited into "block size" and "reminder size". The last bytes of a file was hard to search. (It was possible, but slow, because reminder size is different.) Bad mapping may split a file into "start size", "block size" and "reminder size". More small slices is harder to search and slower. You may consider such problem already.

mdnahas commented 4 years ago

@animetosho The Integrity packets are there to (1) make sure you're not missing any FileIDs and (2) combine everything into a single checksum. Those roles were both done by the MainPacket in Par2.

As for supporting all archiving features, I think Par3 should do EEC and splitting well. Beyond those, it should have basic support for the other features. I don't think we can do compression or encryption as well as other specialized programs. I don't think we want to support metadata for every file system. I'm an old Unix hacker, so I expect to use pipes: "tar | compress | par3".

As for basic metadata support, I don't think I want to get into ownership - there's too many security issues when changing ownership of a file. I think supporting read, write, executable and a last-modified time/data is appropriate. I'd like people to use Par3 for distributing packages and I've seen package directories use those features. Yes, some use ownership or groups, but I think those are a small minority.

"no-one has ever put input slice data into PAR2, even after all these years" I don't think anyone has used any of the optional features of the PAR2 standard. I think most people are too scared to use something when you're not sure if the client on the other end will understand it. It is a simple packet type and not hard to implement. I think it will be good to make it required.

@Yutaka-Sawada I will think about how tightly we want to pack input files into the single virtual file. I had assumed that every file's start would be aligned on a block boundary in the single virtual file. However, I can see the benefit of packing them tighter. I'll have to think about it more.

animetosho commented 4 years ago

The Integrity packets are there to (1) make sure you're not missing any FileIDs and (2) combine everything into a single checksum.

Ah, I see, there's multiple non-integrity packets, so that just joins them all together in a way.

I'd like people to use Par3 for distributing packages

I haven't seen anyone do this, and I'm not sure if there's much merit in it either. On the internet, typically people just re-download a package if, in the very rare case, the download becomes corrupt (and the transport protocol didn't automatically fix it).

I currently see only two primary use cases for PAR:

In the first case, you really don't care about permissions at all. In the second, you'd probably want ownership info along with permissions, if you have either.

For package distribution, I agree that having permissions is useful, where ownership info isn't necessary. I just don't know if that'll ever occur...

As for security concerns, I don't see this as a problem with the standard - it's the client's job to take care of that.

I don't think anyone has used any of the optional features of the PAR2 standard. I think most people are too scared to use something when you're not sure if the client on the other end will understand it

Actually, no PAR2 client supports it (just did a search across GitHub code for "0FileSlic", "FileSlic" and "0RFSC", but found no mention of them in anything other than spec documents; correct me if you know of any client which supports them though). So it's not a case of users being too scared, they don't even have the ability to use it.

As an author of a PAR2 client, my thoughts on the Input File Slice and Recovery File Slice Checksum packets are that the specification designers perhaps wanted such capabilities, in case they ever proved useful, but couldn't think of any actual use case for it, hence making it optional.
Many years later, I still can't think of any realistic use case for it, hence I never bothered adding support. I don't think any other PAR2 client author disagrees, considering its universal lack of support.

I think it will be good to make it required.

Take note that what is written in a specification document may be different to what is actually implemented in applications. In many cases, it's the latter which dictates the true specification.

For example, the PAR2 specification states that part files should have the extension ".vol{START}-{END}", however, PAR2 clients don't do this, instead preferring ".vol{START}+{BLOCKS}". Effectively, the latter is the PAR2 standard, regardless of what the specification says.

If no PAR3 create client supports embedding input data into the Parchive, effectively the feature won't exist even if the specification requires recovery clients to support it.

mdnahas commented 4 years ago

@animetosho If something is missing a feature you need, you don't use it. Par1 was not well suited for backups --- Par2 was and people used it.

animetosho commented 4 years ago

Sure, I don't mind the features if you feel inclined to add them, I was just trying to point out that I don't see it being used.

I definitely don't think supporting embedded files should be mandatory however. Optional features do get used if they're useful.

Yutaka-Sawada commented 4 years ago

The idea of "single virtual file" may be adaptable to FileDataChecksumPacket. In PAR2, 1000 input files make 1000 "Input File Slice Checksum packets" of varied sizes. A big input file makes a big checksum packet, and the damage possibility is higher than small packet. A packet of 100 KB size has higher damage risk than a packet of 1 KB size. To treat the loss of a packet, PAR2 duplicates packets many (~15) times. I felt that such packet repetition might be over power. When someone creates PAR2 file with 10% redundancy, the possible damage rate of input files is 10%, then 1500% redundancy (duplicate 15 times) of PAR2 packets is too much. (15 times of 1000 packets becomes 15000 packets.)

When a single virtual checksum data is made by packing checksums of all input files, you can split the virtual data into favorite size, and put them in less number of packets. Each same size packet has same damage possibility, and it's possible to make Parity packet for them. For example, you split original virtual data into 32 virtual pieces, and create 224 parity pieces (700% redundancy) with 8-bit Reed-Solomon Codes. When a PAR3 file contains 256 checksum packets (32 original packets and 224 parity packets), a PAR3 client will reconstruct original virtual checksum data from available 32 packets in a damaged PAR3 file. Less redundancy such like; 64 original packets and 192 parity packets are possible, too. If a virtual checksum data becomes very large for big input files, it's good to split into more pieces with 16-bit Reed-Solomon Codes. This idea of Parity packets will be better than simple Duplicate packets in efficiency and damage protection. (However, it may be bad in speed and requires complex initial process.)

Yutaka-Sawada commented 4 years ago

Recently I tried to implement single-pass creation for PAR2, as a user requested for a slow HDD. 1(single)-pass means that calculating hash of files/blocks and creating recovery blocks in one time. When RAM size is enough to keep all recovery blocks, I could implement the 1-pass system. When recovery data is larger than RAM size, it seems to be impossible. While it can calculate hash of blocks by keeping each hash state on RAM, hash of files is difficult. Because file hash requires sequential file access and processing bytes from start to last, I could not implement file hashing in skipping read mode.

Then, PAR3 should not contain file hash, if it supports single-pass creation. When there are hashes of all blocks, additional hash of the file isn't so important. For example, current PAR2 contains both MD5 of blocks and file. When MD5 of a block fails to detect an error in the first block, the file's MD5 fails to detect the error, too. Using same hash algorithm for both block and file is useless.

Or file hash is good to avail a special hash algorithm, which supports parallel computation, such like MD6. Or file hash is enough to be a simple checksum (such like, CRC-32 or CRC-64), which is possible to join them later. For example, if block hash uses a set of CRC-32 and something strong hash, file hash isn't required or CRC-64 is enough. File hash may become the last checker, when block hash failed to detect an error. But, the hash algorithm is good to support parallel processing.

animetosho commented 4 years ago

Well, I think single pass creation is only possible if all the recovery data can fit in memory. Otherwrise, with or without a file hash, you can't do a single pass create.

Having said that, I do think the file hash is rather redundant, and am not sure what the purpose of it was (maybe just as a second thing to check?). I suppose it could be useful for other purposes (like if you want to quickly know the MD5 of a file), but ultimately doesn't seem worth it.

Another alternative may be to concatenate all the slice MD5s, and compute the MD5 of that. I still prefer something like CRC32 though.

When recovery data is larger than RAM size, it seems to be impossible

The most you can do is merge the hash pass with the first create pass. Doesn't help too much, particularly if you need many passes, but can help a little.

mdnahas commented 3 years ago

Sorry for the delay. I've been volunteering on other things since March 2020. :wink:

I've started working on Par3 again. And it's probably worth a recap, for everyone to read and for me to refresh my thoughts.

The big goals are:

Minor goals include:

I looked at the whole problem of backing up files. There seemed to be these layers/functions:

I feel that Par3 should consciously avoid encryption. It's hard to do properly. That's best left to an external program.

I also feel that Par3 should try to avoid (serious) compression. Different users will make different speed/storage tradeoffs and may even use specific compression programs (e.g., text compression). I'm not against including ZLIB, but I think it is better to leave compression to an external program.

If you accept that encryption and compression are best left to external programs, we end up with 2 different groups of functions. One is the archiver, checksummer, and grouper. The other is the "ecc-er" and "splitter". So I'm strongly considering splitting Parchive into 2 different file formats, one to handle each group of functions.

I was/am pretty close to a design for the "ecc-er"+"splitter".

I was hoping that I could reuse an existing file format to handle "archiver"+"checksummer"+"grouper", but none that I looked at was any good. And I looked at close to 20 of them! "tar" is crufty and it is missing checksums! "ZIP" doesn't do file permissions well and doesn't have a checksum for all the files together. I even looked into using a filesystem format, but none was good. (But ReiserFS was close!) So, I've started playing with ideas for a very very simple program that does these tasks.

I have run into a difficulty. Par's classic usage is to send the input blocks in their original files ("input1.txt", "input2.txt", ...) and pack the recovery data into ".par2" files. If Par3 is split into two different file formats, the lower-level format ("ecc-er"+"splitter") will not have the list of files. The list of files will be in the upper-level format ("archiver"+"checksummer"+"grouper"). So, if a user runs "par3 repair *.par3", how will the client know which files to look in for the input data?

There are a few solutions. One is to have the user specify the files or a directory with the files. The Par3 client can then search the files for blocks. This might work. (It might also run for a very long time if the user screws up.)

Another solution is to allow some transparency, where the lower-level can "peek" at the list of files in the upper-level. This would only work if the user used our upper-level file format and didn't use encryption or compression. That's pretty restrictive.

The last solution is to write the filenames into the lower-level format. But then we're storing the name of the file twice (that is, once in each layer). I generally try to avoid duplicating data in a file format, because they values could be inconsistent.

Thoughts?

Yutaka-Sawada commented 3 years ago

Welcome back, Michael Nahas. I'm glad to hear that you progress PAR3 project. Though I'm hard to understand mathematical theory, I'll assist as possible as I can.

So, if a user runs "par3 repair *.par3", how will the client know which files to look in for the input data?

There must be a file-list, which tells a file for each block. You posted 3 plans of the file-list. (A) User specific file-list (B) A file-list in upper-level format (C) A duplicated file-list in lower-level format

Another solution is to allow some transparency, where the lower-level can "peek" at the list of files in the upper-level.

This plan (B) will be simple and easy to implement. Even though there are two functionality, lower-level format ("ecc-er" and "splitter") cannot repair files without a file-list. Then, it's natural to use the list of files in the upper-level format (archiver, checksummer, and grouper).

Plan (A) will make another problem of how to specify the files. Plan (C) will be basically same as conbining lower-level and upper-level format partially.

animetosho commented 3 years ago

Thanks for the update/summary.

The Par3 client can then search the files for blocks. This might work.

I think some applications already do this. I know some Usenet posters exploit this by randomly renaming files, then relying on the PAR2 to rename them back.
I don't particularly like relying on this behaviour though.

Another solution is to allow some transparency, where the lower-level can "peek" at the list of files in the upper-level

If all files have to be concatenated (or grouped, as you put it) before ECC is applied, this sounds sensible. It's similar how torrents deal with pieces since all source files are effectively concatenated before it's broken into pieces.

This would only work if the user used our upper-level file format and didn't use encryption or compression. That's pretty restrictive.

It's one reason why I don't fully subscribe to the notion of strict layering. But if you have to, you could treat the metadata as a separate data stream that doesn't go through the same processes as the file data.

But then we're storing the name of the file twice

This sounds bad to me. At least use some shared data, which I guess makes it basically the same as the suggestion above.

LunsTee commented 3 years ago

I'm happy to see renewed interest in par2[3]. I've been using par2 for some time, and while it's met the core of my needs, there are a few details on the fringes that I've been working around. These are a product of my current usage being different from what I think the tool was originally imagined for. I don't know how my suggestions might nest into the current goals list which seems to be at a different level, but I hope this is a good place for me make them.

I use par2 to guard against bitrot on USB sticks. In particular, I have directories filled with JPGs or MP3s, where I keep par2 files of the content. I imagine this is a common use nowadays. The data remains accessible as-is without having extra steps like extracting from a .tar file that things have been saved into. Every now and then, I verify with the par2 files, and sometimes find errors to repair. If the errors aren't too extensive, I'll leave things where they are after repair rather than move to new media, but I find myself cleaning up a few things manually.

First, I like my files to maintain their original time/date stamp, but any recovered files reflect when they were created for recovery. I take care of this manually with a powershell script to copy time/date stamps, but it'd be nice to have an option on the recovery tool to do this when creating recovered files.

The next step after recovery is usually to delete the corrupted files, and then move repaired files out of and back into the directory to restore (FAT) file order. I almost want to suggest that repairs be done directly to the corrupted files rather than creating new recovered files, but I recognize the risk in that if things go awry and the .exe crashes before restoring time/date stamps, the stamps would be lost. So I'm unsure if I would actually use things this way, but it might be a nice option.

Neither of the above is a big deal and I can continue with my workarounds if need be. My last suggestion however would be more appreciated.

The tool as it exists doesn't repair corruption of the parity files, only the data files.. This was reasonable for what I understand the original use case to be, of delivering the data over a channel where some blocks may get lost/corrupted along the way. After reconstructing the data, the parity data is considered to have served its purpose and is discarded. However, for storage bitrot, the 'received' (corrected) data is in turn 'transmitted' (stored) to a future user, again needing protection. This would then require generating new parity data again if the existing .par2 files were discarded/corrupted. In the case of really flaky media, we're now vulnerable to any corruption of the data that happens between repair and creation of new parity data.

It would be nice if phpar2 could have an option for also repairing any errors found in not just the payload, but also the par2 files rather than having to basically generate the par2 files again from scratch.

What I sometimes do is make parity files of parity files. This allows for direct repairs of the primary parity data, and re-creating the secondary parity data goes faster than re-creating primary parity. This isn't as robust as having an equivalent total amount of just primary parity data, and feels a little silly, but seems to work.

TL/DR: Suggest three options: 1) Preserve time/date stamps on repaired files 2) Repair damaged files in-place instead of copy/re-create (and again preserve time/date stamps) 3) Repair damaged recovery blocks in addition to repairing data payload

animetosho commented 3 years ago

Thanks for the suggestions. Note that this topic is mostly about the design of a new PAR format, and not focused on actual PAR2 clients or implementations.

Preserve time/date stamps on repaired files
I like my files to maintain their original time/date stamp, but any recovered files reflect when they were created for recovery

Are you suggesting the modification date of the damaged file be copied across to the repaired file, or the modification date be saved to the PAR file during creation, which can later be used when repairing the file?
The former is a change to the client, whilst the latter does require a format change.

Repair damaged files in-place instead of copy/re-create (and again preserve time/date stamps)

I personally think this is a good idea, but this is an implementation detail of the client, not the format.

Repair damaged recovery blocks in addition to repairing data payload

As you've pointed out, the only real way to do this is to create a PAR2 of the recovery files. PAR2s don't really contain any mechanism to repair themselves (other than critical packet duplication), and I don't think it makes much sense to have such in the format.

You could, of course, have a client that automatically creates PAR2s of the PAR2s (or just write your own script to do this) if you don't want to have to re-create the PAR2s from the source files.
A client could also choose to copy good recovery slices from an existing PAR2 whilst re-computing a new PAR2 to save some computation time. I think it's a good idea, but again, it's an implementation detail, not an issue with the format itself.

It would be nice if phpar2 could have an option for also repairing any errors found in not just the payload

I should probably mention that phpar2 is a fork of the old version of par2cmdline, and hasn't been updated to newer versions for a long time. I can't speak of the author's intentions, but even if the features you request land in par2cmdline, they may never turn up in phpar2.

As for requesting features, I'd suggest creating issues for them instead of appending to this thread as it's not about the PAR format itself.

mdnahas commented 3 years ago

I think I have a rough design.

I've decided that we want to list the filenames in the lower-level format. It keeps the design simple and keeps the design similar to that of Par2.

The rough design:

A Par3 file contains one or more Par3 streams. Each Par3 stream will be identified by a Stream Identifier. The Stream Identifier is part of every packet and is similar to the "Recovery Set ID" of Par2. The Stream Identifier is any globally unique 16-byte value. (We can generate globally unique 16-byte identifiers by using a hash of the computer identifier (e.g., IP address), process identifier, and a high-resolution timestamp.)

A Par3 stream can be used in either a file-recovery context or a streaming context. When used for file recovery, it contains File packets, Directory packets, Symbolic Link packets and a single Root packet. When used for streaming, the stream will contain Data packets and a single End-of-Stream packet.

Par2 kept vital data safe by repeating packets. Par3 will keep that data safe by having a Par file contain a second stream. The second stream can be used to repair the data from the first stream. This is Par-inside-a-Par-file, just like I hope to support Par-inside-a-ZIP-file, to have archives that can repair themselves. Note: Some very small amount of data will have to be repeated, but probably less than 4kB.

Par3, when used in a file-recovery context, supports file permissions, symbolic links, and hard links. Rather than store filenames in each File Descriptor Packet, there are new Directory packets that store filenames and a Root packet, which identifies the root of the directory tree. For file permissions, we'll store a set of generic permissions that are supported by all filesystem/OSes and one set of filesystem/OS specific permissions, that will be used if the data is recovered on the same system that the Par3 stream was generated on.

Par3 supports appending. In a streaming context, it means the client can start a new stream that is appended to the end of an existing stream. In a file-recovery context, it is essentially an incremental backup. How this works will be much clearer after I describe the packets. I'll include a discussion at the end of this post.

Par3 will support any linear code, which means it supports any matrix for generating the recovery data. All input blocks will be mapped to a "single virtual file". The blocks of the single virtual file are multiplied by the matrix to generate the recovery data. The matrix needs a column for each block of the single virtual file and one row for each block of recovery data.

The set of packets would be:

Creator packet This is similar to the creator packet in Par2. It identifies which client created the file. I think, for Par3, we should also require: the version of the client, the command-line options used when invoking the client, and how to contact the client's author if there's a problem.

Start packet Has the blocksize, Galios field size, and the Galois field generator. If this Par3 stream is meant to append to the end of a previous Par3 stream, this packet includes the Stream Identifier of the preceding stream.

Matrix packet A matrix packet specifies the coefficients in one or more rows of the matrix used to generate the recovery data. There will be multiple types of matrix packets, one for each way of generating the coefficients. So, for Reed-Solomon encoding, there will be a Cauchy Matrix packet. For LDPC, there will be a matrix packet that explicitly lists each coefficient and its location. For sparse random matrices, there will be a random matrix packet. (That will require also defining a random number generator.)

File packet This packet represents a file. It does not contain the filename, which is stored elsewhere. This packet map the blocks of the file to blocks of the "single virtual file". If two files have overlapping content, their File packets can map their contents to the same blocks in the "single virtual file". This packet also contains some general permissions for the file (read-only bit, executable bit, creation timestamp, and last modification timestamp). It also has space for OS-specific file permissions (owner, xattr, etc.).

Symbolic Link packet This packet represent a symbolic link. It contains a string representing the path to the file/directory. It may hold general file permissions and OS-specific file permissions.

Directory packet This packet represents a directory. It maps strings (a.k.a. file names or directory names) to files, symbolic links, or other directories. This packet also holds general and OS-specific permissions for the directory. Files might appear in multiple directories, to represent hard links.

Root packet This packet identifies the root directory. If the recovery set is only a single file, this packet will point to the file. This packet contains a bit saying if the directory is an absolute path or a relative path. This packet essentially identifies the recovery set for a Par3 stream.

Checksum packets This packet contains checksums for blocks of the "single virtual file". The checksums include a rolling checksum (e.g., CRC32) and a fingerprint checksum (e.g., cryptographic hash). It performs a role similar to the Input File Slice Checksum packet of Par2.

Recovery packet This packet is similar to the Recovery Slice Packet of Par2. It contains one block of recovery data and the index of the row of the matrix used to generate the data.

End-of-stream packet If Par3 is used in a streaming context, rather than file recovery, this packet is used instead of the File, Symbolic Link, Directory, and Root packets. This packet contains the length of the stream and its checksum. If the stream's length does not align on a block boundary, the fraction of a block left over is included in this packet. This packet represents the end of the stream. If the client wants to append to the stream, it must send a new Start packet and include this stream's Stream Identifier in it.

Data packet If Par3 is used in a streaming context, this packet contains a block of data from the "single virtual file".

Appending

I've discussed how to append in a streaming context above. It is a useful concept when streaming so that a sender can flush the stream and continue sending.

In a file-recovery context, you can also append and it acts like an incremental backup. You append by sending a new Start packet that refers to the preceding stream and then you send more File/Symbolic Link/Directory packets and a final Root packet. The interesting part is that the new Directory packets can point to the File, Symbolic Link, and Directory packets of the preceding stream. And the new File packets can map to blocks in the preceding "single virtual file". That's how it acts like an incremental backup.

So, if a user only changes "foo/bar.txt", the new stream only has to contain the changed blocks, a new File packet for "bar.txt", a new Directory packet for "foo", and a new Root packet. Plus any Matrix packets and Recovery packets that protect the new changes. So, incremental changes only take a small amount of storage.

Par inside Par / Par inside another file

Doing Par-inside-Par isn't very complicated. We only need to (1) allow a File packet to refer to the Par3 file itself and (2) allow a File packet to not map all of its contents to the "single virtual file". That way, we can have a second stream inside a Par3 file protect the first stream inside the Par3 file. (We cannot easily have the second stream protect itself!)

Conclusion

This design is not a very large departure from Par2. It allows any Galois Field and any Matrix. It makes the file/directories explicit and supports file permissions. It does create two separate usages (streaming and file sets) but I think there is a good overlap of those two usages.

Thoughts?

Yutaka-Sawada commented 3 years ago

About "command-line options" in Creator packet;

we should also require: the version of the client, the command-line options used when invoking the client, and how to contact the client's author if there's a problem.

A PAR client may not have command-line options. For example, QuickPar doesn't use command-line to make PAR2 files. I'm not sure that options are worth to be stored in PAR3 files. Though it will help debug, it's useless for normal users.

mdnahas commented 3 years ago

The Creator packet is only there for debugging. It's sole purpose is to be able to track down any client that is producing erroneous files.

True, GUI clients do not have command-line options. I wrote "command-line options" just as a quick generalization, so that I didn't have to explain all the options that a GUI might want to include in the Creator packet. But, basically, the idea would be to include all the options that were used so that if a particular option was causing a bug, we could find it.

animetosho commented 3 years ago

Thanks for writing all that up - it sounds like a nice spec.

Some thoughts I had:

I think, for Par3, we should also require: the version of the client, the command-line options used when invoking the client, and how to contact the client's author if there's a problem.

All clients already include the version in the creator string. I suppose you could create a separate string field to perhaps make it easier to parse, but that's all it would really do...
I'm not sure of the value of some options string either - I don't really see how it'd help much with debugging or tracking down a problem (particularly since most things can be seen elsewhere in the Parchive itself), but I don't see harm in having a field that the creator could choose to ignore.

That will require also defining a random number generator

Spec defined, or the creator actually embeds code?

This packet contains the length of the stream and its checksum.

Is there a need for a second checksum if we already have a checksum packet?

If the stream's length does not align on a block boundary, the fraction of a block left over is included in this packet.

I don't quite understand this bit. If by "fraction of a block left over" you mean the number of bytes left over, couldn't that be figured out from the length specified above?

If the client wants to append to the stream, it must send a new Start packet and include this stream's Stream Identifier in it.

From what I understand, appending effectively creates an entirely new Parchive, with its own independent recovery blocks? The only difference from PAR2 here being that the two Parchives can be put in the same file instead of separate files?

Since there's another Start packet, does that mean this second stream can use a different block size, GF field width etc?

a new Start packet that refers to the preceding stream and then you send more File/Symbolic Link/Directory packets and a final Root packet

Does this mean the implementation could also just delete the root packet of the preceeding stream, since it's no longer relevant?

Suppose we build a Parchive A, and append Parchive B, where B links to A. If you append again, with Parchive C, I presume this links to B so that a client can find the correct Root packet (which will be the one furthest along this single-ly linked chain)?

The interesting part is that the new Directory packets can point to the File, Symbolic Link, and Directory packets of the preceding stream. And the new File packets can map to blocks in the preceding "single virtual file". That's how it acts like an incremental backup.

If the appended Parchive allows an entirely different block size, GF field width etc, this sounds like it could be a nightmare to support if the two data sets aren't fully independent.
If they have to have the same properties, I don't see much merit in this append idea since you can just tack the new data on the end of this single virtual file, and rewrite the matrix/file/directory/root packets as needed.

Doing Par-inside-Par isn't very complicated. We only need to (1) allow a File packet to refer to the Par3 file itself and (2) allow a File packet to not map all of its contents to the "single virtual file". That way, we can have a second stream inside a Par3 file protect the first stream inside the Par3 file. (We cannot easily have the second stream protect itself!)

Does the File packet allow for holes, or should this secondary Parchive cover everything (recovery packets included) of the first Parchive, with no intermingling of packets from both streams allowed?
With appended Parchives, does this require updating this existing secondary Parchive, or do we create an entirely new secondary Parchive to just protect the appended one?

It also sounds like this second stream's file packet is quite critical, because if it becomes corrupt, you've lost all ability to deal with corruption in the first stream.

I'm presuming this secondary stream uses a different recovery set ID from the main set, which means the repairing client will need to be able to identify what this is from the main set.

Par inside another file

Was there any details on this?

It does create two separate usages (streaming and file sets) but I think there is a good overlap of those two usages.

I'm guessing this means that a client can choose to support only one case?

mdnahas commented 3 years ago

That will require also defining a random number generator

Spec defined, or the creator actually embeds code? In the specification. It could be as simple as the first random number is the hash of "0", the second random number is the hash of "1", etc..

This packet contains the length of the stream and its checksum.

Is there a need for a second checksum if we already have a checksum packet?

Yes. We need a checksum for the entire thing, so that we have a single point of failure and so the final step is the checksum. It is not sufficient to have a checksum for every block, because something could go wrong in the inbetween steps of stitching the blocks together.

Many questions about appending I'll address these in my next post.

Suppose we build a Parchive A, and append Parchive B, where B links to A. If you append again, with Parchive C, I presume this links to B so that a client can find the correct Root packet (which will be the one furthest along this single-ly linked chain)? Yes.

The interesting part is that the new Directory packets can point to the File, Symbolic Link, and Directory packets of the preceding stream. And the new File packets can map to blocks in the preceding "single virtual file". That's how it acts like an incremental backup.

If the appended Parchive allows an entirely different block size, GF field width etc, this sounds like it could be a nightmare to support if the two data sets aren't fully independent. In version 3.0, there would be only one block size and GF for a stream and any appended streams. The spec will be written so that a client might support multiple block sizes in the future.

Doing Par-inside-Par isn't very complicated. We only need to (1) allow a File packet to refer to the Par3 file itself and (2) allow a File packet to not map all of its contents to the "single virtual file". That way, we can have a second stream inside a Par3 file protect the first stream inside the Par3 file. (We cannot easily have the second stream protect itself!)

Does the File packet allow for holes, or should this secondary Parchive cover everything (recovery packets included) of the first Parchive, with no intermingling of packets from both streams allowed? The File packet would allow for holes.

It does create two separate usages (streaming and file sets) but I think there is a good overlap of those two usages.

I'm guessing this means that a client can choose to support only one case?

Yes. I think some clients will only care about streaming. Other clients will only care about the file-based recovery. But if you implement the file-based recovery, it will be very little work to support the streaming use case.

mdnahas commented 3 years ago

So, appending.

The appending concept is really trying to solve two different problems. We want two different solutions. I'm not sure. Let me explain the problems and maybe the design choices will be clearer and maybe I will answer your questions.

So, the first problem we face is when a sender "flushes" a stream. So, imagine you're sending updates to many downstream receivers and, after each update, you want to flush the stream and send the data and send recovery data. The problem is that each update may not fill a block and we do recovery on blocks.

This is easier to understand with an example. So, let's assume the block size is 100 bytes. The sender's first update is 950 bytes long, their second update is 150 bytes long, and the sender issues a flush between the updates. It is clear that the first update completely fills 9 blocks,, but then there is still 50 more bytes. What do we do with the last 50 bytes? AND, what do we do when the second update is sent?

One option is to put the last 50 bytes of the first update into the 10th block and, when the flush occurs, compute recovery data over the 10 blocks. That makes sense up to this point. When the program sends the second update of 150 bytes, we put 50 more bytes into the 10th block and the last 100 bytes into the 11th block, and calculate more recovery data over the 11 blocks. The difficulty with this approach is that we have two different 10th blocks --- the first batch of recovery data was calculated with 50 bytes of data in it and the second batch of recovery data was calculated with 100 bytes of data in it. That might lead to confusion. I'm sure some client writers can handle it, but I prefer to make it easy as possible for the client writers.

The second option is to put the last 50 bytes of the first update in to the 10th block and then say "there is no more data in the 10th block". The second update would then put 100 bytes into the 11th block and 50 bytes into the 12th block (and say "there is no more data in the 12th block"). In this scenario, the 10th block is the same when calculating recovery data for the first and second updates. The difficulty is that clients now need to keep track of the amount of data in each block. And that's ugly and complicated.

The third option is to never compute redundant data on partially filled blocks. So, the first update completely fills 9 blocks and the redundant data is only computed on those 9 blocks, and not the 50 bytes in the 10th block. That data has to be sent multiple times for it to be reliably received. The second update will fill the 10th block and 11th block, so its redundant data can be computed on all 11 blocks. Using this approach, any recovery is done on blocks that never change and we don't need to keep track of any partially filled blocks, except the last.

I like the last approach, but I'm not sold on it yet.

I said that appending was trying to solve two different problems. The other problem was incremental backups.

Again, it is easier to explain with an example. The example is that someone has a large home directory, say 400GB of data, and wants to protect it. So they run Par3 once and generate a PAR file. They then modify a few files in the directory. They then generate a new PAR file, but this second file mostly holds redundant data for the few files that changed. The second PAR file is much smaller than the first and very fast to generate. Then the data is damaged and the user wants to recover the data at the time of the second PAR file.

I'm going to refer to the first PAR file as the "full backup" and the second PAR file as the "incremental backup".

Each PAR file needs its own list of files and directories, but the incremental backup could reuse some of the descriptions of files and directories in the full backup. Ideally, we only record the changed files in the incremental backup.

The second PAR file could also reuse any data that was protected by the full backup. So, if the only change to a file is that a few bytes are modified, we want the incremental backup to protect those new bytes of data and reuse the blocks of data that were protected in the full backup.

So, in the design, the incremental backup is virtually appended to the full backup. That way, it can reuse the data blocks from the full backup and reuse the file/directory descriptions from the full backup. Each backup would have its own Root packet, because each backup represents a different snapshot of the directory tree.

When recovery is done using the incremental backup, we can use the recover blocks of the full backup. BUT, we do not need to recover all the blocks from the full backup --- we only have to recover the blocks used by the incremental backup. That won't make a difference if Reed-Solomon was used for the full backup, but it will make a big difference if the full backup was done with LDPC some of the other linear codes.

So that's what I was trying to do. I hope the explanation answered more of your questions.

animetosho commented 3 years ago

Thanks for the explanation.

It is not sufficient to have a checksum for every block, because something could go wrong in the inbetween steps of stitching the blocks together.

Can you think up any scenario where that is even possible?

Assuming that each packet has a checksum, the Checksum packet's checksum effectively is a checksum of the whole stream, acting like a two level hash tree. Hash trees are already well proven to work without the need for any other hash, and have been used in a fair number of places.

With the stream having a second checksum, presumably this also means that each File will also have one?

Mentioned before, but I'm with @Yutaka-Sawada in really not liking the file hash in PAR2 - it puts a hard limit on achievable performance, due to not being parallelisable (you can process multiple files in parallel, but balancing that out with I/O can be a little tricky).
This is less of an issue if MD5 is replaced with a faster hash, but it still seems like an unnecessary limiter.

If the file checksum was CRC32, that would be easier, since CRC can be computed in parallel. Including it in PAR2 would be pointless, since you can calculate the file CRC32 by stringing together the block CRCs, but it'd make a little more sense in this proposal as files don't have to align to blocks.
Of course, the Checksum packet's checksum is just as good, but if you must, an additional CRC32 per file doesn't really hurt I guess.

When the program sends the second update of 150 bytes, we put 50 more bytes into the 10th block

I just realised that this isn't possible unless the checksum supports appending (probably not likely for a cryptographic hash), or the checksum can be amended in the subsequent send. For the latter case, it does feel kinda pointless to force the last block to be resent/amended.

That data has to be sent multiple times for it to be reliably received.

That feels somewhat counter-intuitive to the point of Parchive. I think a flush generally means that the receiver gets the full picture and not require some different renegotiation mechanism (assuming that's even possible).

The difficulty is that clients now need to keep track of the amount of data in each block. And that's ugly and complicated.

Actually that doesn't sound complicated - PAR2 clients already zero pad files to fit blocks, so I'd imagine it's just the same thing. You don't really need to know the amount of data for every block, since the end-of-stream packet includes the total length.

They then generate a new PAR file, but this second file mostly holds redundant data for the few files that changed. The second PAR file is much smaller than the first and very fast to generate

I think I get your idea now - append in streaming mode is a completely separate Parchive which covers the new data only (and maybe unaligned part from the previously flushed block) - no back references, but concatenated in the same stream.
Append in file mode is also a completely separate Parchive, in a separate file - you're not trying to mix two PARs in a file (ignoring the Parchive metadata protection). In terms of recovery, there's no cross referencing between the Parchive sets, only the File/Directory packets can reference the linked Parchive.

I haven't thought much about how to implement this, but trying to repair from two or more recovery sets at the same time does add some complexity.

Each PAR file needs its own list of files and directories, but the incremental backup could reuse some of the descriptions of files and directories in the full backup

Since appending always sends an updated Root packet, the Root packet of the previous Parchive is effectively unneeded. It, by itself, can be used for repair purposes, but doing so would likely be a mistake. Correct?

So, if the only change to a file is that a few bytes are modified, we want the incremental backup to protect those new bytes of data

Presumably the appender does a full verify on the existing Parchive, then finds the updated blocks as 'damaged'. Instead of repairing these though, it just considers these to be the input blocks for the 'appended' Parchive and generates recovery that way.
So in most cases, it's more a full block than just the bytes that were changed. Does that sound right?

I can see this "updating causing blocks to be damanged" could get problematic over time, since the original Parchive will just keep losing more valid blocks as updates come through. I can't think of a better way to do it though, so perhaps the user will need to occasionally recompute the entire Parchive or have some merge/compaction routine.

mdnahas commented 3 years ago

Thanks for the explanation.

It is not sufficient to have a checksum for every block, because something could go wrong in the inbetween steps of stitching the blocks together.

Can you think up any scenario where that is even possible? Pretty much any bug. And the bug could be on the side of the sender, as well as the receiver. And any random cosmic ray that interferes.

For any archiving program --- or any program doing compression, encrypting, or sending --- the very first thing that the writing process should do is compute a checksum of all the data. And the very last thing the reading process should do is verify that the data has arrived correctly. For the streaming context, we need that checksum.

Assuming that each packet has a checksum, the Checksum packet's checksum effectively is a checksum of the whole stream, acting like a two level hash tree. Hash trees are already well proven to work without the need for any other hash, and have been used in a fair number of places.

So, what I think you're asking is: In the file-recovery context, do we need the Checksum packet? Because we already have a Root packet, which has a checksum that contains (directly or indirectly) all the File, Directory, and Symbolic Link packets and the File packets contain the checksum of the data in the files.

And my answer is that, for the file recovery context, we do not need the Checksum packet. The Root packet will serve as the checksum-of-all-the-data for the file recovery context.

With the stream having a second checksum, presumably this also means that each File will also have one? Yes. The File packet will have a checksum for all the data in the file.

Mentioned before, but I'm with @Yutaka-Sawada in really not liking the file hash in PAR2 - it puts a hard limit on achievable performance, due to not being parallelisable (you can process multiple files in parallel, but balancing that out with I/O can be a little tricky). This is less of an issue if MD5 is replaced with a faster hash, but it still seems like an unnecessary limiter.

If the file checksum was CRC32, that would be easier, since CRC can be computed in parallel. Including it in PAR2 would be pointless, since you can calculate the file CRC32 by stringing together the block CRCs, but it'd make a little more sense in this proposal as files don't have to align to blocks. Of course, the Checksum packet's checksum is just as good, but if you must, an additional CRC32 per file doesn't really hurt I guess.

As I said above, the first thing any archiving program should do is compute a checksum of all the data being sent. That has to include a checksum of every file's contents. That checksum has to be calculated before the program does anything where a bug or random event might modify the data. I cannot imagine a design that does not have a hash for every file. I am even annoyed that any hash of the file's meta data (file permissions, creation time, etc.) has to be after that data has been been transformed by the program into a byte sequence and stuffed in a packet, because there is no "raw" form of on-disk meta data.

I consider per-file hashes a necessity. I am very open to discussing which hash function to use. There is a large variety out there and we can always develop our own, or ask a researcher to develop a new one with our requirements.

I chose MD5 for Par2 because it was large, well-known, and there were good libraries for it. Par2 didn't rely on it being cryptographically secure. It was only important that each block and file hash was unique.

When the program sends the second update of 150 bytes, we put 50 more bytes into the 10th block

I just realised that this isn't possible unless the checksum supports appending (probably not likely for a cryptographic hash), or the checksum can be amended in the subsequent send. For the latter case, it does feel kinda pointless to force the last block to be resent/amended.

That data has to be sent multiple times for it to be reliably received.

That feels somewhat counter-intuitive to the point of Parchive. I think a flush generally means that the receiver gets the full picture and not require some different renegotiation mechanism (assuming that's even possible).

Remember, we're not sure if people will use Par3 in a streaming context. And, even if they do, flush is a rare operation. And, even if they do flushes, most users will try to send data in complete blocks because that's easy for recover. Because of all this, I'm fine if the support for flushes is inefficient. And, because it is a rare event, I'd prefer the client's author only has to worry about it when it happens and not have code running all the time to handle it.

The difficulty is that clients now need to keep track of the amount of data in each block. And that's ugly and complicated.

Actually that doesn't sound complicated - PAR2 clients already zero pad files to fit blocks, so I'd imagine it's just the same thing. You don't really need to know the amount of data for every block, since the end-of-stream packet includes the total length.

I think it would be complicated because it would be in the middle of everything else. I definitely think the client's authors are up to the task, if we chose that design. But, as I said above, I think flushes will be a very rare event and I'd like to make the core recovery code as clean and simple as possible.

They then generate a new PAR file, but this second file mostly holds redundant data for the few files that changed. The second PAR file is much smaller than the first and very fast to generate

I think I get your idea now - append in streaming mode is a completely separate Parchive which covers the new data only (and maybe unaligned part from the previously flushed block) - no back references, but concatenated in the same stream. Append in file mode is also a completely separate Parchive, in a separate file - you're not trying to mix two PARs in a file (ignoring the Parchive metadata protection). In terms of recovery, there's no cross referencing between the Parchive sets, only the File/Directory packets can reference the linked Parchive. I haven't thought much about how to implement this, but trying to repair from two or more recovery sets at the same time does add some complexity.

I think you have the idea. You'll only be recoverying 1 set of files at any time. (The set reachable from one Root packet.) I don't think it will be very different from the one-file case.

Each PAR file needs its own list of files and directories, but the incremental backup could reuse some of the descriptions of files and directories in the full backup

Since appending always sends an updated Root packet, the Root packet of the previous Parchive is effectively unneeded. It, by itself, can be used for repair purposes, but doing so would likely be a mistake. Correct?

Yes. Exactly. The only files/directories you care about are those reachable by the second Root packet.

So, if the only change to a file is that a few bytes are modified, we want the incremental backup to protect those new bytes of data

Presumably the appender does a full verify on the existing Parchive, then finds the updated blocks as 'damaged'. Instead of repairing these though, it just considers these to be the input blocks for the 'appended' Parchive and generates recovery that way. So in most cases, it's more a full block than just the bytes that were changed. Does that sound right?

My guess is that the appender would compare the length and last modification timestamp of the files. And, if very picky, could verify checksums. Any new or modified files would go into the new archive.

The appender would then compute checksums for the blocks of those files. If any of those checksums are already in the original archive, the appender could consider those blocks already protected. The blocks that hadn't been seen before would go into the appended Parchive.

I can see this "updating causing blocks to be damanged" could get problematic over time, since the original Parchive will just keep losing more valid blocks as updates come through. I can't think of a better way to do it though, so perhaps the user will need to occasionally recompute the entire Parchive or have some merge/compaction routine.

Yes, that's true. As more changes are made to a set of files, the original backup becomes less and less useful. At some point, it is easier to just generate a new complete backup.

mdnahas commented 3 years ago

So, per-file hash functions.

A year ago, I looked at some. The fast cryptographic hashes that stood out were KangarooTwelve (K12) and Blake2. The developers of Blake2 were eager to get it used, but it didn't have good libraries at the time.

A hash is "cryptographic", if it is difficult for an adversary to guess the input given the output. But we don't care about that. We care that the output is unique. Wikipedia says those hash functions are fingerprints and we could use Rabin's fingerprint.

Wikipedia says CRCs are a checksum, which means they are used for detecting certain types of transmission errors. I'm not sure how that is different from other hashes.

Besides speed, there is a requirement on size. My current idea for Par3 is to support filesystems up to 2^128 in size. (We are expected to exceed 2^64 by 2040.) We can support 2^128 using upto 2^64 blocks with a blocksize of up to 2^64. Thanks to the birthday problem, with 2^64 block hashes, we need the hash needs have much more than 0.5*(2^64)^2 values to avoid collisions, so something much larger than 128-bits = 16 bytes. We're probably fine with a 16-byte value, but I'd be happier with 20-bytes or larger.

XXHash claims to be fast. XXH128 has a 128-bit output and is about 50 times faster than MD5.

I'm not sure how fast Rabin's fingerprint is. This website says it is fast and works as a "rolling hash". This paper seems to cover implementing it with a GPU for high-speed.

Does anyone know of other options we can look at?

Yutaka-Sawada commented 3 years ago

The appender would then compute checksums for the blocks of those files. If any of those checksums are already in the original archive, the appender could consider those blocks already protected. The blocks that hadn't been seen before would go into the appended Parchive.

This appending feature is interesting. The system is similar to "differential backup". Instead of calculating "recovery blocks" from whole source blocks, it stores "different block data" only. While the difference (modified blocks) is a few, the process is very fast and additional file size is small.

For example, when there are 3 source files and recovery file for them; FileDataA FileDataB FileDataC FullRecoveryData = FileDataA + FileDataB + FileDataC

When a user updates the FileDataA, it stores different blocks. Because it doesn't read nor modify other files, appending is very fast; UpdatedFileDataA FileDataB FileDataC FullRecoveryData DiffDataA = UpdatedFileDataA - FileDataA

To recover UpdatedFileDataA from other files, it temprary restores FileDataA at first. Then, it reconstructs UpdatedFileDataA from the FileDataA and DiffDataA; FileDataA = FullRecoveryData - FileDataB - FileDataC UpdatedFileDataA = FileDataA + DiffDataA

To recover FileDataB or FileDataC, it partially restores FileDataA at first. Because different blocks in the FileDataA are lost, it requires more redundancy to recover other files. When there are enough recovery blocks, it's possible to recover all lost blocks; PartialFileDataA = UpdatedFileDataA - DiffDataA FileDataB = FullRecoveryData - PartialFileDataA - FileDataC FileDataC = FullRecoveryData - PartialFileDataA - FileDataB

As more changes are made to a set of files, the original backup becomes less and less useful.

If it stores "differential parity blocks", problem may be solved.

For example in above case, there are 3 blocks in FileA; FileDataA = [111][222][333]

I modify 1 block as a test case; UpdatedFileDataA = [111][222][444]

Then, the "differential backup" is the modified 1 block; DiffDataA = [444]

Though I can reconstruct UpdatedFileDataA from FileDataA and DiffDataA, I cannot recover FileDataA without FullRecoveryData and other files; UpdatedFileDataA = FileDataA and DiffDataA = [111][222][333] and [444] = [111][222][444] PartialFileDataA = UpdatedFileDataA's same blocks = [111][222][ lost block ]

If I make "differential parity blocks" at their incremental backup time, it can recover FileDataA, too; DiffParityDataA = FileDataA + UpdatedFileDataA = [333] + [444] = [777]

Now, it's possible to convert FileDataA and UpdatedFileDataA each other; FileDataA = DiffParityDataA - UpdatedFileDataA = [777] - [111][222][444] = [111][222][333] UpdatedFileDataA = DiffParityDataA - FileDataA = [777] - [111][222][333] = [111][222][444]

Because DiffParityDataA can restore FileDataA from UpdatedFileDataA, it doesn't need additional recovery blocks. But, this method requires both FileDataA and UpdatedFileDataA to calculate parity blocks. Such like before over-writing old backup data in incremental backup.

Does anyone know of other options we can look at?

Though I cannot understand the theory, Rabin fingerprint seems to be similar to CRC. I found an article in the internet; Do Rabin Fingerprints have any advantages over CRC? So, I feel that faster (simpler) is good.

I prefer CRC-32C or CRC-64-ISO for "rolling hash". If SSE4.2 is available, CRC-32C is the fastest. But, 32-bit hash may cause collision problem in so many possible blocks (upto 2^64 blocks !).

When number of blocks is supposed to be tremendous or it should be software implementation, CRC-64-ISO would be the fastest. It's possible to calculate CRC-64-ISO quickly without table lookup on 64-bit CPU. Fast CRCs by Gam D. Nguyen

animetosho commented 3 years ago

So, what I think you're asking is: In the file-recovery context, do we need the Checksum packet? Because we already have a Root packet, which has a checksum that contains (directly or indirectly) all the File, Directory, and Symbolic Link packets and the File packets contain the checksum of the data in the files.

Actually the opposite. You need the Checksum packet because you otherwise wouldn't be able to identify corrupt blocks. I'm questioning the need for a checksum in the File packet.

I cannot imagine a design that does not have a hash for every file.

I'm not disagreeing with you there, but consider that there are multiple ways you could approach computing this hash.

If we assume MD5, some ideas could be:

  1. just pass the file straight through MD5, and use the output as the hash
  2. compute the MD5 of the first 16KB, then the MD5 of the next 16KB, and continue on in 16KB chunks. Concatenate all these MD5s into a long string, and compute the MD5 of this, the output of which serves as the file hash
  3. as above, breaking the file into 16KB chunks, but instead of concatenating all hashes together, concatenate pairs of hashes, computing the MD5 of each, which halves the amount of hash data, then keep repeating the process until you're left with one hash. This is otherwise known as Merkle-tree hashing

It seems like you're a strong advocate for (1), however I'm suggesting that 2 and 3 are also valid choices, and are widely used. For example, the Bittorrent info hash (a torrent's unique identifier) is computed similar to (2), whilst the Bittorrent V2 specification works similarly to (3).

Also keep in mind that the concept of stringing together blocks exists everywhere, even if you don't introduce the notion yourself. Disks store things as sectors (or flash drives use "pages" but emulate sectors, meaning they effectively have two layers of blocks). RAID arrays often have a notion of a stripe size (just another form of "block"). File systems are made up of blocks. Files are read off disk and cached by the OS in pages (i.e. blocks of memory) before being transferred to the application requesting the read. If this is a remotely mounted filesystem, data will be broken into (likely TCP) packets, sent over the wire, then re-assembled at the other end.
Even MD5 works by splitting the input into 64 byte blocks.

A bit of an aside, but no performance concious client will literally hash the whole thing before doing anything else (if that's what you're trying to imply should be done) - hashes are typically computed alongside processing, and even par2cmdline does this (if sufficient memory is available (despite the confusingly misnamed deferhashcomputation variable)).

I am even annoyed that any hash of the file's meta data (file permissions, creation time, etc.) has to be after that data has been been transformed by the program into a byte sequence and stuffed in a packet, because there is no "raw" form of on-disk meta data.

I'm a bit lost on you with this one.
Putting aside compatibility issues (which alone would make the concept infeasible), my guess is that there's some fear that any form of transformation imposes serious risks. I wouldn't say the thought is completely unfounded, but rather, ignorant of the larger picture - from a software perspective, everything is the result of many layers of transformation, whether it be probing arragements of magnets or a decryption performed by dm-crypt.

Converting metadata to/from a byte representation is likely a few lines of simple code. If you believe that reading data of disks generally works, you should have nothing to fear from this.

I think it would be complicated because it would be in the middle of everything else. [...] I think flushes will be a very rare event and I'd like to make the core recovery code as clean and simple as possible.

I don't get the "the middle of everything else" bit, but zero-padding blocks sounds like, by far, the simplest and easiest to code solution.

To re-use your 950 bytes with 100 byte block size example, and assume we have a single recovery block (which is effectively an XOR sum of all other blocks):

  1. sender sends 9x 100 byte blocks, plus 1x 50 byte block
  2. sender computes the XOR sum of all these blocks, zero-padding the last block so that it aligns correctly, and sends this 100 byte recovery block
  3. receiver gets the 9x 100 byte blocks as usual. The next data block has a smaller size as declared by the packet header, but this is trivial to handle as the receiver already needs to deal with differently sized blocks. The received End-of-stream packet confirms the total length of data is as received
  4. if an corruption is detected in a received block, recovery is computed by computing an XOR sum across valid blocks, zero-padding the last block if necessary, alongside the 100 byte recovery block

The steps needed to add padding or deal with smaller blocks is probably less than 5 lines of code (an if condition, maybe a memcpy/memset, maybe a modulo somewhere).
On the other hand, inventing a retransmit protocol and implementing that is likely much more than 5 lines of code.

A hash is "cryptographic", if it is difficult for an adversary to guess the input given the output. But we don't care about that. We care that the output is unique

Cryptographic hashes have a certain guarantee of uniqueness. The same cannot be said for non-cryptographic hashes.
As such, using cryptographic hashes makes designs simpler, because the client author doesn't have to be concerned about handling different blocks with the same hash.

I personally have a preference for cryptographic hashes, as you don't necessarily know what context users will use the format in, and most will be unaware of how it works under the hood. Parchive may not be "secure" in all possible contexts, but eliminating one source of problem can sometimes be very helpful.
Having said that though, I don't otherwise have any strong attachment to a crypto hash either, but if you're looking for the simplest solution, hashes with stronger guarantees can make a design easier.

it didn't have good libraries at the time

I'm personally not too focused on availability of libraries - by the time anyone tries to implement the next Parchive, the situation may change. If the hash isn't too complex, one can just implement it oneself if necessary.
Note that both MultiPar and ParPar use a custom MD5 implementation, despite the wide availability of MD5 implementations. par2cmdline doesn't use a library implementation of MD5 either - I don't know if the code was taken from an existing implementation, but the code doesn't attribute it to anything else, so I assume it was also specifically implemented for par2cmdline.

But popular hashes do have their advantages, including being more vetted/understood, and there are some PAR2 parsers which likely use library implementations. I just don't consider the lack of libraries at present to be a huge concern.

My current idea for Par3 is to support filesystems up to 2^128 in size. (We are expected to exceed 2^64 by 2040.)

Exceeding 2^64 bytes (16EiB) by 2040 for typical servers or consumer systems sounds unlikely. Perhaps possible for some large-ish storage system with hundreds/thousands of disks.

I'm not sure about 2^128 though (the number is so large it's considered cryptographically secure) - I'd expect 2^80 bytes (= 1048576 EiB) to not be passable for a very very long time, and that includes large-ish storage servers.

mdnahas commented 3 years ago

This appending feature is interesting. The system is similar to "differential backup". ...

@Yutaka-Sawada, I didn't get the point of your long example. Are you saying that if we have two versions of a file, we should store the difference between them and protect that? That is, create a file that is essentially bidirectional, allowing us to compute the second version given the first version and the first version given the second version?

So far, I've been assuming that we could do recovery on the second version directly. That is, without ever restoring the first version. Of course, given the equations, we might be able to recover more data for the second version by also recovering the data for the first version.

Let me explain with an example.
FileA V1 has 3 blocks: [1][2][3].
After an update, FileA V2 has a block changed: [1][2][4]. For FileA V1, we created 3 recovery blocks: [1+2][2+3][1+3] For FileA V2, we created 2 recovery blocks: [2+4][1+4]

So, my idea is that when we do recovery on V2, we can use the [1+2] block from the original set of recovery blocks and the [2+4] and [1+4] blocks from the new set. But it's clear that if we lost all data blocks and the recovery block [2+4], we wouldn't be able to recover V2 with just [1+2] and [1+4]. BUT, we could recover all the blocks for V1 and V2 if we had [1+2], [2+3], [1+3] and [1+4]. (BTW, I am assuming recovery blocks are not just XORs of each input, but each input is multiplied by a different constant like in Reed-Solomon.)

Would this change if the recovery blocks for FileA V2 were bidirectional? For example, those blocks were [2+4-3] and [1+4-3]? Well, in this case, knowing block 3 is vital to any recovery. And, I think, in most cases, we won't have block 3 around. That is, the user will have edited the file on the drive and the file will hold only block 4 and not block 3. We will have more recovery blocks around that contain 3, but I'm not sure if that helps us or hurts us.

Incremental backups only help us if the client doesn't have to read every input file. That is, the client only reads the input files where the length or the last-modification timestamp have changed. Further, I think they only make sense with LDPC or some other spare code matrix algorithm. I need to figure out when people might use this feature and now seems like a good time to calculate the numbers.

For the sparse random matrix that I'm thinking of using, each of the N input blocks appears log(N) times in the recovery data. (That is, the code matrix has log(N) non-zero values in each row.) If there are R recovery blocks, then an individual block appears in log(N)/R portion of the recovery blocks. When the user changes K input blocks, only (1-log(N)/R)^K of the recovery blocks will remain valid. That value is equal to ((R-log(N))^K)/(R^K). If we have a fixed percentage of recovery blocks, so R=N5% or R=N10% or, generically, R=NP, then making an incremental backup only makes sense when NP >> log(N). That means, this incremental approach only works when N is very large or K is very small.

Let me do two concrete examples. In the first, we have N=1,000 blocks of input data and R=50 recovery blocks (or 5%). Each input block would be part of log(N)=10 recovery blocks. That's harsh, since a change to any input block will invalidate 20% of the recovery blocks. After just 3 input blocks changed, half the recover blocks would not be useful. After 10 input blocks changed, all of them would be useless.

But with N=1,000,000 and R=50,000 (still 5%), things change. Each input block is part of log(N)=20 recovery blocks. That number is bigger, but it is a much smaller portion of all the recovery blocks. One change to an input block only invalidates 0.04% of recovery blocks. It takes 1,700 changes to input blocks before half the recover blocks would not be useful. After 11,500 changes to input blocks, 1% of the recover data is still useful.

Having looked at the numbers, recovering the second version of a file without also recovering the first version only makes sense for really small changes. Even in the N=1,000,000, each input block change invalidates 20 times as many recovery blocks and there are fewer recovery blocks than input blocks. If more than 0.1% of the input blocks change, you would probably want to recompute all the recovery data. BUT, if we assume we are recovering both the first and second versions of files, the numbers work much better.

Does anyone know of other options we can look at?

Though I cannot understand the theory, Rabin fingerprint seems to be similar to CRC. I found an article in the internet; Do Rabin Fingerprints have any advantages over CRC? So, I feel that faster (simpler) is good.

Yes, Rabin Fingerprints seem really close to CRC. I'm not sure what their difference is yet.

I prefer CRC-32C or CRC-64-ISO for "rolling hash". If SSE4.2 is available, CRC-32C is the fastest. But, 32-bit hash may cause collision problem in so many possible blocks (upto 2^64 blocks !).

The rolling hash is just used to prevent running the unique hash on every window. So think of it as a 2^32 time speed up. ;)

When number of blocks is supposed to be tremendous or it should be software implementation, CRC-64-ISO would be the fastest. It's possible to calculate CRC-64-ISO quickly without table lookup on 64-bit CPU. Fast CRCs by Gam D. Nguyen

That's good to know. The paper mentions a 128-bit CRC. Is that standardized?

mdnahas commented 3 years ago

So, what I think you're asking is: In the file-recovery context, do we need the Checksum packet? Because we already have a Root packet, which has a checksum that contains (directly or indirectly) all the File, Directory, and Symbolic Link packets and the File packets contain the checksum of the data in the files.

Actually the opposite. You need the Checksum packet because you otherwise wouldn't be able to identify corrupt blocks. I'm questioning the need for a checksum in the File packet.

I cannot imagine a design that does not have a hash for every file.

I'm not disagreeing with you there, but consider that there are multiple ways you could approach computing this hash.

Okay. I'm following you now.

In Par2, the block checksums were stored in Input File Slice Checksum packets. So, they were directly associated with the file. You're saying that we can just use the checksum of that packet as the file's checksum.

In the Par3 design that I'm thinking of, the Checksum packets contains checksums for blocks of the "single virtual file" (a.k.a., the stream) and the File packet contains a mapping from the file's contents to the "single virtual file". The reason for this is that I can imagine two files having similar contents. So, the common sections of the two files may share blocks in the "single virtual file". Moreover, if we allow multiple small file's contents to mapped contiguously to a single block, the block checksums will not have a relationship to any single file's contents.

This is why I had a hard time understanding what you were saying. The Checksum packets in this design have changed and the Checksum packet won't fulfill the role of the old Input File Slick Checksum packet.

Still, we could compute the file checksum using a checksum-of-block-checksums, like you're talking about. The data just wouldn't be inside a packet like in Par2.

If we assume MD5, some ideas could be:

1. just pass the file straight through MD5, and use the output as the hash

2. compute the MD5 of the first 16KB, then the MD5 of the next 16KB, and continue on in 16KB chunks. Concatenate all these MD5s into a long string, and compute the MD5 of this, the output of which serves as the file hash

3. as above, breaking the file into 16KB chunks, but instead of concatenating all hashes together, concatenate pairs of hashes, computing the MD5 of each, which halves the amount of hash data, then keep repeating the process until you're left with one hash. This is otherwise known as Merkle-tree hashing

It seems like you're a strong advocate for (1), however I'm suggesting that 2 and 3 are also valid choices, and are widely used. For example, the Bittorrent info hash (a torrent's unique identifier) is computed similar to (2), whilst the Bittorrent V2 specification works similarly to (3).

Also keep in mind that the concept of stringing together blocks exists everywhere, even if you don't introduce the notion yourself. Disks store things as sectors (or flash drives use "pages" but emulate sectors, meaning they effectively have two layers of blocks). RAID arrays often have a notion of a stripe size (just another form of "block"). File systems are made up of blocks. Files are read off disk and cached by the OS in pages (i.e. blocks of memory) before being transferred to the application requesting the read. If this is a remotely mounted filesystem, data will be broken into (likely TCP) packets, sent over the wire, then re-assembled at the other end. Even MD5 works by splitting the input into 64 byte blocks.

A bit of an aside, but no performance concious client will literally hash the whole thing before doing anything else (if that's what you're trying to imply should be done) - hashes are typically computed alongside processing, and even par2cmdline does this (if sufficient memory is available (despite the confusingly misnamed deferhashcomputation variable)).

Yes, a performance conscious client will. But what about an assurance-conscious client? Even with a performance-conscious client, if there were multiple operations done on each block of data, I would make sure that calculating its checksum was the first operation done.

We can use a checksum-of-checksums for the file checksum. Or a checksum that can be composed. We just need to pick a one.

FYI, at the moment, the blocksize is in the Start packet and the File packet doesn't currently depend on the Start packet. This is because I could imagine doing repairs on multiple blocksizes. Multiple blocksizes certainly won't be part of Par 3.0, but I'm leaving it open for the future. I'm only saying this because that we might not want to have the file checksum depend on the blocksize.

I am even annoyed that any hash of the file's meta data (file permissions, creation time, etc.) has to be after that data has been been transformed by the program into a byte sequence and stuffed in a packet, because there is no "raw" form of on-disk meta data.

I'm a bit lost on you with this one. Putting aside compatibility issues (which alone would make the concept infeasible), my guess is that there's some fear that any form of transformation imposes serious risks. I wouldn't say the thought is completely unfounded, but rather, ignorant of the larger picture - from a software perspective, everything is the result of many layers of transformation, whether it be probing arragements of magnets or a decryption performed by dm-crypt.

Converting metadata to/from a byte representation is likely a few lines of simple code. If you believe that reading data of disks generally works, you should have nothing to fear from this.

It is a simple transformation, but it is a transformation. And when I have a dozen client authors each writing their own transformation code for any number of OS and filesystem combinations that each may have their own corner cases, it makes me worry. It may seem like paranoia, but the checksum is how we know data is sent correctly. One reason I have to worry is HFS+, the default filesystem on Macs from 1998 to 2017, which changes Unicode filenames. It does "normalization" on Unicode. So, on that filesystem, metadata will change without the client author doing anything. So, the checksum may match when writing the data but not match when reading it back in. I may sound paranoid about some things, but the paranoia is not irrational.

The steps needed to add padding or deal with smaller blocks is probably less than 5 lines of code (an if condition, maybe a memcpy/memset, maybe a modulo somewhere). On the other hand, inventing a retransmit protocol and implementing that is likely much more than 5 lines of code.

It is not a retransmit protocol. It is repetition of a packet. We currently repeat vital packets. So, in the rare case that someone does flush a stream and the flush doesn't align on a block boundary, the extra fraction of a block gets put in a packet that we would repeat anyway. Yes, it is inefficient in terms of data usage.

I don't think the implementation you talked about is just "5 lines of code". It is 5 lines of code at one place. Data structures have to change and more code has to test for partially-filled blocks and deciding when you mean block X, whether you mean the partially filled block or the completely filled block.

I will keep thinking about it.

A hash is "cryptographic", if it is difficult for an adversary to guess the input given the output. But we don't care about that. We care that the output is unique

Cryptographic hashes have a certain guarantee of uniqueness. The same cannot be said for non-cryptographic hashes. As such, using cryptographic hashes makes designs simpler, because the client author doesn't have to be concerned about handling different blocks with the same hash.

There are many kinds of hashes. There are non-cryptographic hashes that provide the uniqueness we want. When I say "unique", I mean it is hard for a random error to generate a duplicate hash. When I say "cryptographic", I mean that it is hard for an adversary, when given a hash value, to create a duplicate for that hash. Every cryptographic hash is also a unique, but some hashes are unique without being cryptographic. For example, Rubin's fingerprint is unique without being cryptographic.

I personally have a preference for cryptographic hashes, as you don't necessarily know what context users will use the format in, and most will be unaware of how it works under the hood. Parchive may not be "secure" in all possible contexts, but eliminating one source of problem can sometimes be very helpful. Having said that though, I don't otherwise have any strong attachment to a crypto hash either, but if you're looking for the simplest solution, hashes with stronger guarantees can make a design easier.

I too prefer a cryptographic hash. But the non-cryptographic ones are probably much faster. Do we slow everyone else down for the users that want a cryptographic hash?

it didn't have good libraries at the time

I'm personally not too focused on availability of libraries - by the time anyone tries to implement the next Parchive, the situation may change. If the hash isn't too complex, one can just implement it oneself if necessary. Note that both MultiPar and ParPar use a custom MD5 implementation, despite the wide availability of MD5 implementations. par2cmdline doesn't use a library implementation of MD5 either - I don't know if the code was taken from an existing implementation, but the code doesn't attribute it to anything else, so I assume it was also specifically implemented for par2cmdline.

But popular hashes do have their advantages, including being more vetted/understood, and there are some PAR2 parsers which likely use library implementations. I just don't consider the lack of libraries at present to be a huge concern.

It's not so much about the libraries themselves, but that having many libraries is a sign that a hash is being used, tested, and will stick around. Client authors may or may not use the libraries, but they need to find out about how to implement the hash. That is, we don't want to pick an obscure hash that isn't documented well and doesn't have useful implementations.

My current idea for Par3 is to support filesystems up to 2^128 in size. (We are expected to exceed 2^64 by 2040.)

Exceeding 2^64 bytes (16EiB) by 2040 for typical servers or consumer systems sounds unlikely. Perhaps possible for some large-ish storage system with hundreds/thousands of disks.

I'm not sure about 2^128 though (the number is so large it's considered cryptographically secure) - I'd expect 2^80 bytes (= 1048576 EiB) to not be passable for a very very long time, and that includes large-ish storage servers.

I just extrapolated the size of a harddrive over time. Here's one graph of it, from Wikipedia.

I don't expect users to exceed 2^64 any time soon. In fact, I won't require clients to support it. But I figure it is good to design the spec around it, since it has been 20 years since I published the Par2 spec.

The support for 2^128 will be by supporting 2^64 blocks, with blocksizes up to 2^64. That seemed enough space in the number of blocks and blocksizes. I wasn't sure that 2^32 would be sufficient for either. And since the spec now requires values to be 8-byte aligned, it seemed like the right scale for the values.

mdnahas commented 3 years ago

I'm looking at what file attributes and file system features to support.

As far as file attributes, I thought of supporting:

I found out that UNIX file systems do not have a file creation time. What I thought was the creation time is actually the time that the metadata was modified. So, if you change the owner of a file, that time gets updated.

FAT file systems do not have an "executable bit". (It has "read-only", "archive", "hidden" and "system" bits.) Cygwin simulated the bit when "the filename ends with .bat, .com or .exe, or if its content starts with #!".

I also thought of supporting hard links and soft links (a.k.a., "symbolic links").

Hard links are supported by UNIX file systems and NTFS, but not other systems like FAT and exFAT.

Soft links are ... well, it seems like there are many flavors of soft links. UNIX file systems support one version. MacOS supports UNIX soft links and aliases, where the link is preserved even if the target file moves. NTFS has symbolic links, but it is very locked down and a link to a directory is different from a link to a file. Also, Windows has "shortcuts", which are stored in ".lnk" files. Cygwin simulates UNIX's soft links using these shortcut files. Lastly, there's the difference between relative and absolute paths for the soft links --- Window's "shortcuts" only use absolute paths.

I think the simplest approach to all this is to just copy what ZIP does.

It looks like ZIP's only default file attribute is the last modification time. (It also has a bit for whether the file is text or binary, but I don't think that's important.)

ZIP has an optional set of attributes for most file systems, so FAT, NTFS, and Unix. E.g., for UNIX it stores: permissions (read/write/execute, but not setuid), file type (regular/symlink/device), last access time, last modification time (in UNIX time format), owner, and group. If the file type is a symlink or device, it also store the path or device number.

ZIP stores the UNIX permissions in the file, but for unzip to set the permissions, it needs to run as root (superuser). I'm not sure what happens when you unZIP a file with symlinks on a FAT file system. I'm also not sure how it translates permissions from one file system to another.

Sooooo.... what do we do?

I would like that if a user creates a PAR file on a file system and tries recovery on the same system, the metadata is preserved. I mean, most metadata is as important as the data itself. But I also want some basic support for permissions across different file systems. I could imagine someone distributing software or a code directory in a PAR file and it should work on multiple file systems. (BTW, the only file attribute supported by "git" is the executable bit, and it is horrible.)

My current thought is:

  1. support the default attributes I mentioned above (creatation time, modification time, read-only, executable).
  2. support hard links

File systems that do not support the default attributes can find a way to approximate them. Systems that do not support hard links can just created duplicates of the files.

We add packets for file-system specific attributes. E.g., UNIX file attributes, NTFS file attributes, etc..

For soft links, the link is either to another file inside the PAR file or to an external file. If it is to another file inside the PAR file, we replace the soft link with a hard link and we set a bit in the file-specific attributes indicating that that it is really a soft link. If linked-to-file is not inside the PAR file, we replace the soft link with a file containing the path to the external file and set a bit in the file-system-specific attributes indicating that it is really a soft link. Because the path is file-system specific, we do not need to translate it to a machine-independent format. (That is, replace "\" with "/", etc.)

Does that make sense?

It does mean the per-file overhead might be pretty large. Packet headers are 64 bytes. The File packet's content is 88 bytes. A packet containing Unix file system attributes might have a content of 32 bytes. So, each file takes up 64+88 + 64+32 = 248 bytes ... plus the length of its filename. That's a lot. If people want to use this with lots of tiny files, they better ZIP or TAR them first.

animetosho commented 3 years ago

Regarding checksums, I think I see where you're coming from better now, though if all checksums in the Checksum packet are valid, and all mapping entries in the File packet are valid, I don't see any way to construct a file that wouldn't match a file checksum.

I guess I'm not really against having a file checksum per se - I mostly don't like how it was done in PAR2, where it seems rather pointless.

With a file hash though, for file updates, I presume the whole file needs to be rehashed, regardless of how small the change may be? I suppose that'll probably need to be done anyway, to find what parts of the file have changed (and probably more than just that file by itself, since a block hash can contain more than one file).

But what about an assurance-conscious client?

Well I'd hope a performance conscious client is also conscious about being accurate, so I'm going to assume you meant an application that focuses on being as safe as possible with performance not being a priority.

It's perhaps an interesting idea, but I can't really see such a thing ever being particularly practical.
Entertaining the thought though, I see two primary things you could try to assure against:

So I can only see it possibly helping with external problems (e.g. memory corruption). In the case of hashing, it'd only be able to detect if the data differed between the two read passes (hashing pass + recovery compute pass).
The possibility of reading the same data giving you something different is non-zero, but then, the possibility of the following code failing is also non-zero:

variable = 5
# double check that it really is 5
if variable != 5
    error: we've got corruption
# okay, I'm fairly certain the variable is now 5, let's continue

(and if doing two passes over the data doesn't give enough assurance, you could decide to do three, five or more, but it starts getting really silly beyond a point)

Considering that Parchive is designed to detect problems, I personally think it makes more sense to let the user decide how protected they want to be against external problems. In other words, build a performance-focused client, and let them do a verification pass after Parchive creation, if they want to be extra certain.

This is because I could imagine doing repairs on multiple blocksizes

If that's the plan, then introducing some block size for hashes makes things more complicated, though if the client has to deal with multiple block sizes, maybe not so much.

I may sound paranoid about some things, but the paranoia is not irrational.

I think I understand where you're coming from better now. Yeah, each system has their own corner cases and quirks, but I don't think the design should be built around the idea of trying to strictly match what's on the file system.
For example, many systems don't support changing a file's creation date, so if you store it expecting to always match, it likely won't happen.

Personally, I'm much more fearful of getting other aspects of the spec correct (such as handling file updates correctly) and the future possibility of dealing with multiple block sizes (and maybe GF sizes, even if the initial spec doesn't permit it). Metadata conversion is relatively quite low on the scary list, and can be covered easily with some test cases.

It is not a retransmit protocol. It is repetition of a packet.

Ah I see. So repeating packets in the next Parchive is still acceptable? Meaning that one could ignore the idea of Parchive protecting a Parchive?

Actually thinking about it, you'd probably have to rely on critical packet duplication, as trying to protect Parchive's own data seems tricky in a streaming context.

Data structures have to change

Depends on implementation I guess, but a client could always just do whatever internal translation it desires to not require any data structure changes.

Regardless, ignoring the need to duplicate packets, I'm not sure it's particularly simpler this way, as you'd specifically need logic to exclude the last block in recovery computation in both sender and receiver, as well as logic to determine whether a data block needs to be repeated.

whether you mean the partially filled block or the completely filled block.

It's an interesting idea to fall back on duplication, but I'm not sure you can avoid the notion of partially filled blocks in the specification, particularly considering the File context.

If the virtual file's size is not a multiple of the block size, you've got a partially filled block (and since data isn't embedded, you don't have the option of duplicating the data block - the partially filled block must be protected).
If a protected file is modified by only a few bytes, the changes may occupy a partially filled block.

Overall, I don't particularly mind partial blocks or duplicating the last data block - the former feels a bit more natural to me though.

Every cryptographic hash is also a unique, but some hashes are unique without being cryptographic.

The point I was trying to make is that crypto hashes are a stronger degree of "unique" because no-one can demonstrate a case where they aren't, whilst the same cannot be said for a non-crypto hash. A non-crypto hash may be probabilistically unique, given random data, however real data generally isn't random.

Do we slow everyone else down for the users that want a cryptographic hash?

Depends on where you want to be on the performance vs assurance line perhaps? =)

Both BLAKE3 and K12 are claiming gigabytes per second, per core, on modern CPUs. I think the non-crypto hashes are at best, an order of magnitude faster.
For present day scenarios, BLAKE3/K12 seem fast enough in general, though I'd expect storage speeds to increase faster than CPU speeds, so can see value in a faster hash for the future.

Personally I'm still slanted towards a crypto hash, as long as its design isn't a performance limiter (like it is with PAR2's MD5). I'm not sure how fast LDPC can be in practice, but hashing is only part of the computation needed to be done, so the additional speed of a non-crypto hash won't have as much of an overall impact as the hash-only benchmarks would indicate.

I just extrapolated the size of a harddrive over time. Here's one graph of it, from Wikipedia.

Yeah, that graph definitely didn't extrapolate well with diminishing returns.
Taking the last 10 years shown (2000-2010), there's roughly a 100x density improvement from 10GB to 1TB. Extrapolating that out to 2010-2020, we should be at 100TB hard drives now.
The largest 3.5" hard drive I see for sale is 18TB, though the idea of a 100TB in a drive today isn't absurd - it's just not something for the average consumer.

The support for 2^128 will be by supporting 2^64 blocks, with blocksizes up to 2^64. That seemed enough space in the number of blocks and blocksizes. I wasn't sure that 2^32 would be sufficient for either. And since the spec now requires values to be 8-byte aligned, it seemed like the right scale for the values.

If you're requiring 8 byte alignment, yeah, there's little reason to go below that.

But I'd imagine number of blocks to primarily be limited by the GF size. Presumably, this design will be limited to GF64?


modification time (64-bit nanoseconds-since-Epoch)

Nanoseconds seems oddly precise. Is there any filesystem which goes below microseconds?
Regardless, assuming the number is signed, that covers the years 1677-2262 which should be plenty, so eh.

I assume the times to be UTC normalised?

FAT file systems do not have an "executable bit".

Windows, in general, doesn't use the notion of an executable bit. Yeah, on NTFS/ReFS you can set an executable ACL, which I suppose acts like the execute bit, but it's not something that's typically used.
Generally, extensions determine whether a file is executable or not, as cygwin emulates.

I would like that if a user creates a PAR file on a file system and tries recovery on the same system, the metadata is preserved. I mean, most metadata is as important as the data itself

It's a good ideal, just don't get too stuck on the idea of preserving everything.
There's all sorts of complicated features that file systems support, from sparse files and reflinks to forks, and you can bet there'll be more to come.

I like to think there's a reason why most archive formats have limits on how much metadata they try to preserve, particularly with going across platforms.

If it is to another file inside the PAR file, we replace the soft link with a hard link and we set a bit in the file-specific attributes indicating that that it is really a soft link. If linked-to-file is not inside the PAR file, we replace the soft link with a file containing the path to the external file and set a bit in the file-system-specific attributes indicating that it is really a soft link.

What's the idea behind differentiating the two? I would've thought that you'd handle both cases like you would for an external file.

Because the path is file-system specific, we do not need to translate it to a machine-independent format

Did you mean "do need to translate" there?

The File packet's content is 88 bytes. A packet containing Unix file system attributes might have a content of 32 bytes. So, each file takes up 64+88 + 64+32 = 248 bytes

So there's two packet headers here? I'd have thought you could just stick to one packet.
You may also want to consider shortening the packet header if it's expected there'll be a lot of packets. For example, current PAR2 spec has 16 bytes just to signal the packet type, which doesn't seem entirely necessary.

That's a lot. If people want to use this with lots of tiny files, they better ZIP or TAR them first.

...except those formats still have to store the metadata somewhere - you've just moved it from Parchive to ZIP/TAR.

Yutaka-Sawada commented 3 years ago

Are you saying that if we have two versions of a file, we should store the difference between them and protect that? That is, create a file that is essentially bidirectional, allowing us to compute the second version given the first version and the first version given the second version?

It was just a experimental thought of possible implementation. I didn't know how your appending system worked. So, I thought two simple ways; "storing copy of updated blocks" or "storing parity of different blocks". If this idea isn't so good (or bad), please ignore it.

The paper mentions a 128-bit CRC. Is that standardized?

No, it's not standard currently, but is a possible example. Though 128-bit CRC can be fasten by using SSE's XMM register, there would be no worth. (Complex implementation may erase the advantage of simple CRC.) Because modern CPUs support 64-bit integer, I thought that 64-bit CRC was good. But, max 2^32 blocks (and 32-bit CRC) will be enough for PAR3, while PAR2 is hard to fill 2^16 blocks still.

mdnahas commented 3 years ago

With a file hash though, for file updates, I presume the whole file needs to be rehashed, regardless of how small the change may be? I suppose that'll probably need to be done anyway, to find what parts of the file have changed (and probably more than just that file by itself, since a block hash can contain more than one file). Yes.

But what about an assurance-conscious client?

Well I'd hope a performance conscious client is also conscious about being accurate, so I'm going to assume you meant an application that focuses on being as safe as possible with performance not being a priority. Yes, that's what I meant.

Considering that Parchive is designed to detect problems, I personally think it makes more sense to let the user decide how protected they want to be against external problems. In other words, build a performance-focused client, and let them do a verification pass after Parchive creation, if they want to be extra certain. Given that Parchive is meant to repair problems, it seems odd to do another step to detect them.

This is because I could imagine doing repairs on multiple blocksizes

If that's the plan, then introducing some block size for hashes makes things more complicated, though if the client has to deal with multiple block sizes, maybe not so much. I believe CRC hashes can be computed on any block scale and stitched together.

I may sound paranoid about some things, but the paranoia is not irrational.

I think I understand where you're coming from better now. Yeah, each system has their own corner cases and quirks, but I don't think the design should be built around the idea of trying to strictly match what's on the file system. For example, many systems don't support changing a file's creation date, so if you store it expecting to always match, it likely won't happen.

Personally, I'm much more fearful of getting other aspects of the spec correct (such as handling file updates correctly) and the future possibility of dealing with multiple block sizes (and maybe GF sizes, even if the initial spec doesn't permit it). Metadata conversion is relatively quite low on the scary list, and can be covered easily with some test cases. OK.

It is not a retransmit protocol. It is repetition of a packet.

Ah I see. So repeating packets in the next Parchive is still acceptable? Meaning that one could ignore the idea of Parchive protecting a Parchive? A client could do either approach.

Actually thinking about it, you'd probably have to rely on critical packet duplication, as trying to protect Parchive's own data seems tricky in a streaming context. Even with Par-inside-Par, the inside Par would have critical packets that are not protected by Par: BlockGF, Matrix, Checksum, Root and File. The best policy is to duplicate these. Luckily, for the Par-inside case, I think these packets will be tiny. They probably all fit inside 4kB.

Every cryptographic hash is also a unique, but some hashes are unique without being cryptographic.

The point I was trying to make is that crypto hashes are a stronger degree of "unique" because no-one can demonstrate a case where they aren't, whilst the same cannot be said for a non-crypto hash. A non-crypto hash may be probabilistically unique, given random data, however real data generally isn't random. I don't know the details of the Rubin Fingerprint yet. It is a non-crypto hash and might provide the guarantees you're talking about.

Do we slow everyone else down for the users that want a cryptographic hash?

Depends on where you want to be on the performance vs assurance line perhaps? =)

Both BLAKE3 and K12 are claiming gigabytes per second, per core, on modern CPUs. I think the non-crypto hashes are at best, an order of magnitude faster. For present day scenarios, BLAKE3/K12 seem fast enough in general, though I'd expect storage speeds to increase faster than CPU speeds, so can see value in a faster hash for the future.

Personally I'm still slanted towards a crypto hash, as long as its design isn't a performance limiter (like it is with PAR2's MD5). I'm not sure how fast LDPC can be in practice, but hashing is only part of the computation needed to be done, so the additional speed of a non-crypto hash won't have as much of an overall impact as the hash-only benchmarks would indicate. OK.

I have a worry about this, but I'll write it up below.

The support for 2^128 will be by supporting 2^64 blocks, with blocksizes up to 2^64. That seemed enough space in the number of blocks and blocksizes. I wasn't sure that 2^32 would be sufficient for either. And since the spec now requires values to be 8-byte aligned, it seemed like the right scale for the values.

If you're requiring 8 byte alignment, yeah, there's little reason to go below that.

But I'd imagine number of blocks to primarily be limited by the GF size. Presumably, this design will be limited to GF64?

I think you could have any GF size. The block size has to be a multiple of the GF size, but there is no restriction the other direction. LDPC works using XOR, so technically could do it with GF 2 that fits in a single bit! But, in practice, yeah, I imagine that if you're going to use large numbers of blocks, you probably want to use a larger GF. For Reed-Solomon it is required --- the GF has to have more elements than the number of blocks.

The File packet's content is 88 bytes. A packet containing Unix file system attributes might have a content of 32 bytes. So, each file takes up 64+88 + 64+32 = 248 bytes

So there's two packet headers here? I'd have thought you could just stick to one packet. You may also want to consider shortening the packet header if it's expected there'll be a lot of packets. For example, current PAR2 spec has 16 bytes just to signal the packet type, which doesn't seem entirely necessary. I was thinking one packet for the generic File information. Then a second packet for the UNIX-specific file information.

Yeah, the packet header could be smaller. 64 bytes seems huge. The packet length has to be 8 bytes and the checksum has to be 16 bytes, but the other fields could shrink. If we make the Stream ID 8 bytes and combine the Magic sequence and packet type, we can probably get the header down to 40 bytes.

That's a lot. If people want to use this with lots of tiny files, they better ZIP or TAR them first.

...except those formats still have to store the metadata somewhere - you've just moved it from Parchive to ZIP/TAR.

Yes, but then we aren't to blame. hahahaha

mdnahas commented 3 years ago

I'm now thinking that file attributes are so annoying that Par's default should be to ignore them. Ignore hard links, soft links, and all that. Just store regular file data and filenames.

We can still support each filesystem's features using optional packets. Maybe include in the standard: "EXT4 filesystem attributes" packet, "FAT filesystem attributes", and "NTFS filesystem attributes". Those packets can handle the special features of each file system.

A related problem I'm thinking about is: how do I know I received all the optional packets that were sent by the sender?

I had originally thought that in the file-recovery context, the Root packet would represent "all the data". The Root would have a hash of the top-level directory, each directory contain a hash of all its contents, and each file would have a hash of its data. So, if you got the Root packet, you could verify that you got all the data.

But with the optional file attribute packets, how do you know you got all of them? Do they have a similar tree structure and, if you get the file attribute Root, then you have all of them? We cannot require that there is one for every File and Directory packet, because there may be more: hard links and soft links would not be associated with a File packet or Directory packet. I think we'll need to duplicate the entire tree structure.

mdnahas commented 3 years ago

So, I read about Rabin's fingerprint and it will not help us. Rabin's fingerprint is a CRC.

The problem Rabin was trying to solve is to find a substring. So, given a string S, does the string T appear in it and, if so, where? Rabin's algorithm creates a random rolling hash function, computes the hash of T, and then passes the rolling hash over the string S to find out where T is located in it. Because the rolling hash function is random, the algorithm runs in O(N) time with very high probability.

Rabin's fingerprint is the random rolling hash function. It is a CRC, created from a random irreducible polynomial.

Choosing a random CRC for each Par file doesn't gain us anything.

animetosho commented 3 years ago

Given that Parchive is meant to repair problems, it seems odd to do another step to detect them.

It is odd, but it's one method of additional assurance.
Your suggestion of doing a hash pass separate from all processing doesn't seem to be terribly different.

I believe CRC hashes can be computed on any block scale and stitched together.

With CRC, yes, but if we're including a non-CRC hash, you'd have to make sure that's supported by it.

the inside Par would have critical packets that are not protected by Par: BlockGF, Matrix, Checksum, Root and File. The best policy is to duplicate these.

That's interesting - so what's the outer PAR ultimately trying to protect?

I think you could have any GF size

Sorry, what I actually meant was the max GF size was 64-bits. If you can't have more than 2^64 blocks, then you wouldn't be supporting GF128, for example. Well I suppose you could, but there'd be little benefit. (GF128 is unlikely to be useful anyway)

I was thinking one packet for the generic File information. Then a second packet for the UNIX-specific file information.

What about combining the two into a single packet? That reduces packet overhead, and also solves your optional packets problem.

mdnahas commented 3 years ago

Given that Parchive is meant to repair problems, it seems odd to do another step to detect them.

It is odd, but it's one method of additional assurance. Your suggestion of doing a hash pass separate from all processing doesn't seem to be terribly different.

I did not suggest that. I suggested that some client author might want to do a separate pass. Not all clients have to be implemented the same way.

I believe CRC hashes can be computed on any block scale and stitched together.

With CRC, yes, but if we're including a non-CRC hash, you'd have to make sure that's supported by it.

Since Rabin's Fingerprint has been eliminated, the only non-CRC hashes that we're talking about are cryptographic ones. Those would need to be applied using a Merkle trees or similar approach, because cryptographic hashes are specifically designed to not have that property.

the inside Par would have critical packets that are not protected by Par: BlockGF, Matrix, Checksum, Root and File. The best policy is to duplicate these.

That's interesting - so what's the outer PAR ultimately trying to protect?

The list of files and directories. Also the Matrix packets, if we're using LDPC, since there would be a lot of them.

I think you could have any GF size

Sorry, what I actually meant was the max GF size was 64-bits. If you can't have more than 2^64 blocks, then you wouldn't be supporting GF128, for example. Well I suppose you could, but there'd be little benefit. (GF128 is unlikely to be useful anyway)

The only benefit would be if performance was faster with a larger GF, but I don't think that's currently the case.

I was thinking one packet for the generic File information. Then a second packet for the UNIX-specific file information.

What about combining the two into a single packet? That reduces packet overhead, and also solves your optional packets problem.

There are many reasons. The biggest is that we'd have to design all the optional file-system specific data right now. So adding a new file system in the future would be hard. Second, some of the UNIX-specific file information can't be done that way (e.g., hard and soft links). And the file format's optional packets are really the place for that information.

Yes, overhead is large but, if overhead is too large, users have an out. So we're not straightjacketing users.

animetosho commented 3 years ago

The list of files and directories. Also the Matrix packets, if we're using LDPC, since there would be a lot of them.

Hmm, so it sounds like packet duplication is the intended way to deal with corruption in the Parchive itself, but larger packets can use redundancy to reduce the need to duplicate these?

The biggest is that we'd have to design all the optional file-system specific data right now. So adding a new file system in the future would be hard.

I'd have thought you'd just stick in a type+size field for the metadata into the File packet. Adding a new system is just defining a new type+size, and then the fields within, so shouldn't be particularly hard.

Second, some of the UNIX-specific file information can't be done that way (e.g., hard and soft links).

I thought symlinks would have their own packet type, so no special requirement other than that is really necessary? You'd of course still add the type+size data to the symlink packet as described above.

Hardlinks are probably a bit finnicky to deal with, though I don't see how either choice really changes how you'd approach it?

mdnahas commented 3 years ago

The list of files and directories. Also the Matrix packets, if we're using LDPC, since there would be a lot of them.

Hmm, so it sounds like packet duplication is the intended way to deal with corruption in the Parchive itself, but larger packets can use redundancy to reduce the need to duplicate these?

Yes. If someone backs up a few files, duplication is probably fine. If someone backs up a huge directory, they can do par-inside-par.

The biggest is that we'd have to design all the optional file-system specific data right now. So adding a new file system in the future would be hard.

I'd have thought you'd just stick in a type+size field for the metadata into the File packet. Adding a new system is just defining a new type+size, and then the fields within, so shouldn't be particularly hard.

Well, it limits you to a single file system's attributes. (I can think of distributions that want to be able to be decoded on both EXT4 and NTFS.) And it means including type and length information in the packet ... which is what packets are for. So, its simpler to just have more packets. If slightly more overhead.

Second, some of the UNIX-specific file information can't be done that way (e.g., hard and soft links).

I thought symlinks would have their own packet type, so no special requirement other than that is really necessary? You'd of course still add the type+size data to the symlink packet as described above.

Symlinks and hard links would be UNIX-specific, so they have to go in a UNIX-specific packet.

animetosho commented 3 years ago

Well, it limits you to a single file system's attributes. (I can think of distributions that want to be able to be decoded on both EXT4 and NTFS.)

Fair point. I can't actually see this being a common case (particularly since it'd be awkward to integrate this into a create client), but we'll see I suppose.

By the way, do you really intend to be specific to the point of filesystems like "ext4", or just a general "Unix common properties" concept?

Symlinks and hard links would be UNIX-specific, so they have to go in a UNIX-specific packet.

Has this been changed? You mentioned earlier that symlinks would be their own packet without mentioning it'd be Unix specific (and Windows does support symlinks/hardlinks anyway).

mdnahas commented 3 years ago

By the way, do you really intend to be specific to the point of filesystems like "ext4", or just a general "Unix common properties" concept?

I wanted to do it as EXT4. But I read the EXT4 disk layout specification and there's a bunch of details that I'm sure to get wrong. So, I'm writing it up as UNIX common properties. I think it will store everything from EXT4, but I'm not positive.

Symlinks and hard links would be UNIX-specific, so they have to go in a UNIX-specific packet.

Has this been changed? You mentioned earlier that symlinks would be their own packet without mentioning it'd be Unix specific (and Windows does support symlinks/hardlinks anyway).

Each file system does symlinks very differently. (E.g., Macs have 2 different kinds of symlinks.) They cannot all go in the same packet.

animetosho commented 3 years ago

From what I can tell, there's only a minor difference between them, so could be catered to with just a boolean flag (which could simplify handling for non-macOS platforms), though a different packet is also an option.

mdnahas commented 3 years ago

I've written a very rough draft. It is still missing a few parts (Sparse random matrices, NTFS and FAT metadata). But it has enough of the pieces nailed down to see what it will look like. It at a good enough point to ask for feedback.

Par3_spec.txt

There are 2 basic changes from my earlier description.

First, streams are broken up into segments. Each time you append, you append a segment. There is a Segment Start packet and a Segment End packet. I changed the StreamID in every packet into a StreamSegmentID. (This was the RecoverySetID in Par2.) Segments and StreamSegmentIDs were necessary to handle the case that the same stream was appended twice (in parallel, not sequentially).

The second big change is getting rid of the BlockGF packet. The block size and Galois field are specified in every packet. This doesn't take up much more space and simplified a few things.

Read it and let me know what you think. I did my own read --- some open questions and some explanations are below.

Mike


Remaining thoughts:

Do we keep Par2's 16kb hash that was used to identify renamed files? Or will the first block in a file be enough to do that?

Should the maximum file size be 2^64 long or 2^128 long? That is, should file offsets be 8-bytes or 16-bytes? I think we want to keep upto 2^64 blocks and a block size of upto 2^64 bytes.

Packet header overhead is large at 64 bytes. We could:

Are comment packets a good thing? Or just a place for spam?

I dropped the BlockGF packet as unnecessary, but was forced to add a SegmentStart packet. We could put the Galois field and the block size in the SegmentStart packet. Having multiple Galois fields doesn't make sense to me, but multiple block sizes does. I'm thinking of repairing on a per-byte basis and a per-block basis at the same time. But I'm not sure even how that would work, with checksums and aligning blocks. Should we lock in a single blocksize and GF and put it in the SegmentStart packet?

The Data packet can only send multiples of 8-bytes. This seems like a limitation. Should we include a byte length inside the packet? Should we allow all packets to be any length, rather than force it to be a multiple of 8? (If we do, we can require padding between packets so that packets always start on an 8-byte aligned address.)

Should we allow compression of data in Data packets? We could one very simple form of compression, like ZLIB. It's worth asking, but compression will never be our strength. We can only include the most simple and common forms of compression, like ZLIB. Also, compression works better for larger blocks; redundancy works better for smaller ones. Users will get far better compression than anything we will include by running proper compression algorithm before running PAR. But it is worth asking the question.

I chose the words "rolling hash" and "fingerprint hash" to match rsync's language.

I'm still not decided on which hashes to use. CRC32C seems fine for a rolling hash, if we don't use a CRC for the fingerprint hash. KangarooTwelve (K12) seems fine. But we might look at something more composable, like a 128-bit CRC or our own variation of a Merkle-tree hash.

External Data checksums are for blocks in the input stream rather than in individual input files. That is, it's the checksum after the file data has been mapped into the stream. This makes sense if you expect files to reuse blocks in the stream; we'd store fewer hashes. But the rolling hash is better associated with files. This is because if we pack the trailing ends of multiple files into a single stream block, the rolling hash of the stream block is not useful for finding the ends of files. We can pack the trailing edges of files into their own blocks, but then we need to account for zero-filling the rest of the block when we try to match the rolling hash. Should File packets contain the rolling hash for the last block? I thought that most of our "trailing ends" will come from files that are smaller than 1 block and, for those, we have a hash of the complete file.

The SegmentEnd packet's purpose is to hold a single hash for the entire stream. For when appending to a stream, I am assuming that we can take the K12 hash of the previous segment and calculate what it will be with the new segment's data appended. If not, we need a composable hash.

The design supports a wide range of Galois fields, but not all. Shuhong Goa's Fast Fourier transform based Reed Solomon codes uses an incompatible Galois Fields. (It has a prime number of elements.) LDPC usually uses a 1-bit Galois field, that is, XOR. XOR can be encoded as a Galois Field, but clients will have an easier time making it run fast if XOR was explicitly encoded in the PAR file. Rather than supporting every GF(2^8^?), do we want to hardcode some values and leave it open to extension in the future? Do we want to say when GF length is 0, the code is XOR?

I believe some Fast Fourier transform based Reed Solomon codes require certain Cauchy matrices or use certain recovery block exponents. Do we know what those are?

The write up of the Cauchy matrix packet mixes computer integer operations and Galois field operations. That could be problematic to understand. It could also be easier to implement. Thoughts?

For the sparse random matrix, what random number generator should we use? I have looked at the PCG-XSL-RR, which has a 128-bit state and generates 64-bit pseudorandom numbers. Another approach is to use the K12 hash to generate random numbers. E.g., the hash of 0 is the first random number, the hash of 1 is the second, etc.. The seed can be the number to start at. This would take a lot more computation to generate random numbers. Thoughts?

Do we want to include pseudocode to describe how to generate the sparse random matrix? It may be easier and clearer than explaining the random number generator and shuffle algorithm in text. Or, do we want the separate reference implementation to be the best form of pseudocode?

The Recovery Data packet and the Data packet are different lengths. Do we want them to be the same length? I considered sending input blocks inside a Recovery Data packet, but the Recovery Data packet has to contain the number of blocks in the stream and that may not be known when we're sending input blocks. (NOTE: It gets the number of blocks in the stream from the Segment End packet, whose checksum is include in the Recovery Data packet.)

Recovery Data is only calculated on complete input blocks. That makes things easier when dealing with appending. I'm happy with the current design there, but sure many people will want to comment on that.

The File packet holds a checksum for each part of the file that is protected, rather than a single checksum for the whole file. I'm okay with that, since the checksum of the File packet itself can serve as a checksum for the whole file.

I chose to store the filename in the File packet, rather than the Directory packet. This was to serve my unsaid goal that packets can be kept small. I wanted packets to be small enough that most can be sent in a single Ethernet packet. (Most Ethernet can support 1500 bytes of payload; the IP+UDP headers take up 28 or 48 bytes of the payload.) Or, small enough to fit in a 4kB disk blocks (which is the size on many disks). I felt that putting a lot of filenames into a Directory packet could make a really huge packet. I could have made my design goal explicit by limiting the size of PAR packets to a constant above the block size, but I decided that that would be to constrictive to the design. I do worry that the overhead when sending a single Data packet in a Ethernet frame (80 bytes out of 1472) is more than 5%. Even using External Data packets, the overhead is 16-bytes of hash for every 1472-byte frame, or more than 1%. Any application streaming over the Internet that's trying to use data efficiently will use multiple IP packets to carry each PAR Data packet. And that makes recovery inefficient, because recovery is most efficient on the scale of blocks. And that very same argument holds for 4kB disk blocks. Even using a smaller 40-byte PAR header, the overhead would be 56-bytes from a 4096-byte block, or 1.4%. That's still pretty high. Each input block needs at minimum an 8-byte offset in the stream, a 16-byte hash of the packet, an 6-byte packet length, and at least 4 bytes of StreamSegementID, so I don't see how the overhead can be much less than 34 bytes. But that would be less than 1% of a 4096-byte block. Should we try to cap the overhead of Data packets to less 40 bytes? Or should we abandon these small scales?

Another measure of overhead is looking at the extra bytes for each file. That overhead is 128 bytes beyond the filename, and that's without counting the metadata, which could be 136 bytes more. That's not great. But if users have a lot of small files, they can use ZIP or TAR and compression.

The Root packet has a single bit to represent if the directory path is relative or absolute. I feel like we want to support absolute paths, even if security requires that, by default, clients do not write absolute paths. I think this bit is the best way to encode them. I didn't like the idea of encoding the root directory in a Directory packet.

For most strings, I used a 2-byte length + UTF-8 string + 0-to-7 bytes of padding. That is more than sufficient for filenames, which are 255 bytes on most systems. Those could be done with a 1-byte length. The 2-byte length is insufficient for xattr on some systems, which could be 64kB or even larger. I'm not sure if 2-byte lengths everywhere is a good compromise or the worst kind of compromise.

EXT4's i_mode variable is complex. I probably need to read about each bit in it and determine if (1) sending clients can read the values in order to write the UNIX File packet or UNIX Directory packet and (2) receiving clients can ever set the bit.

The UNIX Directory packet duplicates almost everything in the Directory packets. That wastes space. But I didn't see a good alternative, if I wanted the tree-of-checksums that (1) provide a checksum for the whole directory tree in one place and (2) lets the receiving client know if they got every packet holding UNIX metadata.

I have the sending client write both the UID integer and username string into the UNIX File/Directory packets. I'm not sure if I could get rid of the UID integer. I feel like there will be strange situations where it will be useful to have. E.g., if one of the files being repaired is the /etc/passwd file where the username to UID mapping is stored. Perhaps we only want to store usernames and let the writing client have an option to put the UID in the string for the username? I feel like that would lead to problems. Storing both is probably cleaner.

Can you think of any other kinds of attacks, where a hacker might employ a PAR file?

Should we keep the ".volXX-YY" and ".partXX-YY" that go before ".par3"?

As far as I know, no client implemented PAR2's optional packets nor the application-specific packets. Should we drop them?

Do we want to make a File packet with an empty filename refer to the PAR file itself, for "PAR inside" usage?

Even with "PAR inside PAR", there are packets that need to be repeated. Is there a way to minimize these? E.g., using default parameters or special packets types?

Yutaka-Sawada commented 3 years ago

Though I had not read nor understood whole text, I have some thoughts. Because my English reading skill isn't so good, I might miss something.

The most odd thing I felt is alignment;

Conventions: The data is 8-byte aligned. This is because some memory systems function faster if 64-bit quantities are 8-byte aligned.

I thought that 4-byte alignment of PAR2 packets was worthless. Now, I think that 8-byte alignment of PAR3 packets will be worthless, too. I feel that less PAR3 file size by smaller packets is more important than slight speed-up by adjusting alignment.

If you remove the restriction by data alignment, you don't need useless padding bytes in most case. Also, you will be able to use less size integer like 2-bytes or 4-bytes instead of 8-bytes or 16-bytes.

Major differences from PAR 2.0 are: drops hash of first 16kb that was used to identify files

I want CRC32 for a case of very large block size. CRC32 of the first 16kb (or less size) will be enough to exclude different files. It's only 4-bytes per each input file. This CRC32 will be usable as rolling hash of small files (less than block size), too. But, it's difficult to find very small files from an archive.

Packet Header 16 byte[16] Type. Can be anything.

I feel that 8-bytes will be enough. This is because the prefix "PAR 3.0\0" is 8-bytes. You may omit the prefix.

External Data Packet 16 {rolling hash, 12-byte fingerprint hash} A rolling checksum and finger print for each input block

I feel that fingerprint hash is good to be 16-bytes. Because fingerprint hash in other usage is 16-bytes, it's normal to be same size for input blocks.

File Packet The first value is an offset in the file. The second value is the length of the section to be mapped. The fourth value is the K12 hash of the data in the section.

I feel that first value will be zero always and second value becomes file size. It's strange to protect a file partially. When the first value is non-zero, there is no hash of the entire file. It cannot determine whether the file itself is complete or broken. Though it's possible to recover the file's mapped area, the status of the file is unknown.

Can you think of any other kinds of attacks, where a hacker might employ a PAR file?

Though this isn't a problem of PAR specifications, DoS attack is possible at verification. PAR client should avoid the huge slow down. I don't know why par2cmdline becomes vulnerable again.

Should we keep the ".volXX-YY" and ".partXX-YY" that go before ".par3"?

For MultiPar, I use ".volXX+YY" format, which is used in QuickPar. XX is starting value and YY is number of included blocks. This format is good to see how many blocks in each PAR file. When each PAR2 file contains 10 recovery blocks; something.vol00+10.par2 something.vol10+10.par2 something.vol20+10.par2

When a user don't want number of blocks, it may use another ".vol_ZZ" format. ZZ is just an index of PAR files. When number of blocks is ignorable; something.vol_1.par2 something.vol_2.par2 something.vol_3.par2

As far as I know, no client implemented PAR2's optional packets nor the application-specific packets. Should we drop them?

I think so, too. Then, it's possible to reduce size of "Packet type".