Working on major Par2 changes. Name?

Parchive / par2cmdline

Official repo for par2cmdline and libpar2

http://parchive.sourceforge.net

GNU General Public License v2.0

708 stars 72 forks source link

Working on major Par2 changes. Name? #130

Open mdnahas opened 5 years ago

mdnahas commented 5 years ago

Hi everyone,

I wrote the specification for Par2 a long time ago. I'm working on the code for a new version of Par. It will include:

Reed-Solomon encoding with James S. Plank's correction
Tornado Codes by Luby

I've spent a week learning the code. I've written unit tests for some of the existing code. The tests should allow me to modify the code without breaking it. The unit tests should be run as part of "make check" but I don't know how to add them. (I've never learned Automake). Can anyone explain how?

I also plan on writing a diff tool that can compare Par files to make sure the packets are bit-for-bit identical. I'll use this to make sure that my changes haven't affected the program's output for version 2 of the specification.

I plan on adding a "doc" directory, which will contain the old Par2 specification and the new specification.

The Tornado Codes will need a predictable pseudo-random number generator. I expect I will use a version of Linear Congruential Generator.

The big question I have is: what do we name the next version and do we want to add a new file extension? At this moment, I plan on keeping all of Par2's packets and just adding new recovery packets. This will mean that par2 clients will still be able to verify the file, but will not be able to fix it. Unfortunately, par2cmdline currently silently ignores any packet type it does not recognize. So, existing users won't know why they cannot fix it. I would normally call the new specification Par2.1 or Par3, except the name "Par3" has been used by the developer of MultiPar. Perhaps we should call it "Par4"?

When we decide on a new name, I'll push a new branch and everyone can take a look at the spec/code.

Mike

Yutaka-Sawada commented 3 years ago

I forgot to mention an important matter. 32-bit rolling hash becomes useless, when number of blocks exceeds 2^32.

The rolling hash is just used to prevent running the unique hash on every window. So think of it as a 2^32 time speed up. ;)

The 2^32 time speed-up might be wrong.

PAR2's normal number of blocks is around 2000~4000 blocks. The number is 2^12. Because CRC-32 is 32-bit rolling hash, the collision rate while sliding search is 1 / 2^(32-12) = 1 / 2^20. So, it's 2^20 time speed-up. Wow, it's 1048576 times faster !

PAR2's max number of blocks is 32768 blocks. The number is 2^15. Because the collision rate is 1 / 2^(32-15) = 1 / 2^17, it's 2^17 time speed-up. OK, it's still 131072 times faster ! There is no problem in PAR2.

For example, when PAR3's number of blocks is 2^30 blocks. Because the collision rate is 1 / 2^(32-30) = 1 / 2^2, it's 4 time speed-up. Huh, it's only 4 times faster !? This may be a problem.

When PAR3's number of blocks exceeds 2^32 blocks. The collision rate is 1 / 2^(32-32) = 1 always. While sliding every bytes, it needs to re-calculate fingerprint hash of block size. There is no speed-up anymore, and sliding search would be impossible. This must be a problem.

If you use CRC-32 as rolling hash, number of blocks should be 2^32 or less. So, I posted that max 2^32 blocks would be enough for PAR3. If you want to set more blocks like 2^64, you must use CRC-64 for speed-up. Or else, sliding search will cause freeze.

animetosho commented 3 years ago

Nice work in getting so far! Getting the spec to this level is certainly no small feat, so appreciate the persistence.

I haven't really read it much yet, will get back when I do.

Do we keep Par2's 16kb hash that was used to identify renamed files? Or will the first block in a file be enough to do that?

My understanding is that the first block may not correspond to a specific file (the block could have multiple files, or it could be the end of a file, etc), so it doesn't sound like the initial file hash is replaceable that way.

Should the maximum file size be 2^64 long or 2^128 long?

2^64. I don't think any common operating system supports files larger than that, or even volumes that exceed that size. Even support for the idea of a 128-bit integer in programming languages is flaky at best.
So much would need to change in the tech space to support >2^64 files (OSes, filesystems, APIs, many protocols and formats etc), I see little movement to adopt anything larger, and I don't think PAR should try to be a pioneer here.

Packet header overhead is large at 64 bytes. We could:

4 byte magic seems too small, since you want it to be unique and easily identifiable. Then again, you could possibly ditch it completely and rely on the StreamSegementID for magic detection. If you don't like your chances with that, you could combine the StreamSegementID with a small magic field I suppose.

Type could probably be reduced to one byte if you wanted to squeeze it down, as that should be more than enough to cover all packet types. The downside might be with supporting custom types, but I can't see that being common, and you could come up with a special scheme just for custom packets.

I can't see length being anything but 8 bytes if you're looking to support block sizes up to 2^64 (unless you want to introduce variable-length integers).

Are comment packets a good thing? Or just a place for spam?

I don't think they're commonly used, but I also don't really see any harm in them.

I'm thinking of repairing on a per-byte basis and a per-block basis at the same time. But I'm not sure even how that would work, with checksums and aligning blocks

It's an interesting concept, but I can't see how anyone would realistically do this, and for what purpose it'd serve.

As you suggest, I'm not even sure how this would exactly be implemented. I'd imagine it'd be the same as all blocks effectively being zero-padded to the length of the longest block?
I think a fixed block size already introduces enough difficulties (most existing PAR2 clients don't handle all allowed scenarios) such that making it even more complex might not be beneficial.

Should we allow compression of data in Data packets? We could one very simple form of compression, like ZLIB. It's worth asking, but compression will never be our strength

I don't see the point in half-assing it. As long as it's easy for users to insert their own compressor (which should be trivial), there's really no need.

If you want, you could reserve some field to represent 'data transformation' or the like so that a future amendment could add it if it becomes necessary. I personally feel it's being a little too ambitious though.

Should File packets contain the rolling hash for the last block?

If the idea is for any block that a file doesn't fully occupy, it might be handy (though I can't quite see where it'd be used).
I don't see much benefit if it's only the last block, and that other blocks may only be partially filled by a file.

I am assuming that we can take the K12 hash of the previous segment and calculate what it will be with the new segment's data appended

I'm not sure, but since SHA3 prevents length extension attacks (that SHA2 and earlier are susceptible to), I doubt that's possible.

Rather than supporting every GF(2^8^?), do we want to hardcode some values and leave it open to extension in the future?

I see little point in requiring decoders support GF(2^1000000), so I definitely think limits on that should be put in place.

Do we want to say when GF length is 0, the code is XOR?

Unless 0 could have some other meaning, it makes sense to me.

The Recovery Data packet and the Data packet are different lengths. Do we want them to be the same length?

I don't see much reason to make them different, though I don't think allowing data packets to have more than one block will be particularly challenging to deal with.

I'm guessing the Recovery Data packet would need to be a multiple of 8 in addition to a multiple of the GF field width, which means GF(2^24) would require block sizes to be a multiple of 24 bytes (unless you allow the last 24-bit integer in the block to be zero padded).
If the Data packet had to be the same size, it'd carry the same multiple requirement (responding to your earlier assertion of forcing Data packets to be a multiple of 8 might be limiting).

Should we try to cap the overhead of Data packets to less 40 bytes? Or should we abandon these small scales?

I don't think PAR is efficient for small blocks. If you were targeting such scales, I think the design would be a fair bit different (e.g. smaller checksums, length specifiers etc).

I'm not sure if 2-byte lengths everywhere is a good compromise or the worst kind of compromise.

If you're forcing 8 byte alignment, having 2 byte lengths seems inconsistent.

However, I think I've mentioned before that I think forced alignment/padding is 95% useless (and in fact, almost certainly slows things down despite intentions), but if you're going to do it anyway, it makes sense to at least be consistent everywhere.

But ignoring 8 byte alignment requirements, I think 2 byte length fields are generally fine for all cases you'd use variable length strings.

I feel like there will be strange situations where it will be useful to have. E.g., if one of the files being repaired is the /etc/passwd file where the username to UID mapping is stored.

I think that's a very out-of-field problem to try to solve, and doubt anyone can realistically solve it well anyway.

My opinion is to store one or the other - it makes little sense to store both (and in fact, may be undesirable to those looking to maintain privacy with distributing Parchives). I believe TAR only stores one, and I'd imagine clients would have to pick one over the other when writing files, which can cause inconsistencies when clients choose different strategies.

Should we keep the ".volXX-YY" and ".partXX-YY" that go before ".par3"?

Funnily enough, I don't think any PAR2 client actually implemented the naming scheme recommended in the specification.
(actually, that's a lie - I initially implemented my client using the spec defined naming scheme, which just caused confusion because no-one else used it, so later switched the default to be consistent with other clients)

I don't think the naming scheme is bad, but the de-facto standard has shifted, so you may as well adopt that.

As far as I know, no client implemented PAR2's optional packets nor the application-specific packets. Should we drop them?

I think some used the comment and unicode packets, but the latter were only necessary because encoding of strings was otherwise unclear (an issue PAR3 doesn't have).

I don't think anyone found a use case for the others, so they remain unimplemented.

E.g., using default parameters or special packets types?

The notion of forced-default parameters is interesting. PAR-in-PAR sounds complex to me, but if metadata can be protected with a fixed GF/matrix/block-size, then no futher metadata is required to represent metadata protection.

This means that if enough metadata-recovery packets are found, the PAR's metadata can be recovered without needing to worry about critical packets.

Whilst the flexibility in selecting how recovery computation is done, is an interesting feature of PAR3, I'm not sure metadata protection needs the same level of flexibility.

In regards to the name, I feel that PAR3 could cause some confusion, given that it's listed on Wikipedia, and there's pages like these. It might be better to pick a different name to avoid ambiguity.

mdnahas commented 3 years ago

The most odd thing I felt is alignment;

Conventions: The data is 8-byte aligned. This is because some memory systems function faster if 64-bit quantities are 8-byte aligned.

I thought that 4-byte alignment of PAR2 packets was worthless. Now, I think that 8-byte alignment of PAR3 packets will be worthless, too. I feel that less PAR3 file size by smaller packets is more important than slight speed-up by adjusting alignment.

I remember having a discussion about this before. Most protocols that I know have integers aligned. This is because most old processors could not handle unaligned accessed. I believe all recent CPUs can handle unaligned accesses. If I recall correctly, on a few systems, unaligned accesses are significantly slower, like a 2-times slow down. (Because the CPU performs 2 memory accesses to read or write the value.)

If an alignment issue is going to slow things, it really only matters on our tightest loops. These are things like hashing, inverting the matrix, generating the recovery blocks, and copying data to and from disk.

I don't think the hashing and copying matter. On systems with an alignment penalty, those functions are usually alignment aware. For example, the copy function will copy data byte-by-byte until it reaches an 8-byte aligned address and then copy 8-byte words.

The other bottlenecks happen on input blocks and recovery blocks and use Galois field operations. I think we should keep those aligned. So, let's keep that the block size must be a multiple of 8-bytes. (I doubt many people will use 24-bit Galois fields.) I will put a note in the specification on how to make sure the payloads of the Data packet and Recovery Data packets are aligned, if any client author wants to do that.

If you remove the restriction by data alignment, you don't need useless padding bytes in most case. Also, you will be able to use less size integer like 2-bytes or 4-bytes instead of 8-bytes or 16-bytes.

The current draft uses 2-byte integers. Do you mean variably-sized integers? E.g., only use 1-byte to store the packet's length if it is fewer than 256 bytes?

Major differences from PAR 2.0 are: drops hash of first 16kb that was used to identify files

I want CRC32 for a case of very large block size. CRC32 of the first 16kb (or less size) will be enough to exclude different files. It's only 4-bytes per each input file. This CRC32 will be usable as rolling hash of small files (less than block size), too. But, it's difficult to find very small files from an archive.

Ok. I'll add it back in.

16 byte[16] Type. Can be anything.

I feel that 8-bytes will be enough. This is because the prefix "PAR 3.0\0" is 8-bytes. You may omit the prefix. Agreed.

File Packet The first value is an offset in the file. The second value is the length of the section to be mapped. The fourth value is the K12 hash of the data in the section.

I feel that first value will be zero always and second value becomes file size. It's strange to protect a file partially. When the first value is non-zero, there is no hash of the entire file. It cannot determine whether the file itself is complete or broken. Though it's possible to recover the file's mapped area, the status of the file is unknown.

Yes, in 99.999% of cases, the offset will be zero and the length will be the file size.

The only reason this exists is for the "PAR inside" feature. If we do "PAR inside ZIP", the protected regions are the compressed files at the front of the file and the directory structure at the end. It does not cover the space in the middle where we'll put the PAR packets. The PAR packets cannot protect themselves. We cannot put the hash of the PAR packets inside the packets themselves. That is impossible with a cryptographic hash.

Can you think of any other kinds of attacks, where a hacker might employ a PAR file?

Though this isn't a problem of PAR specifications, DoS attack is possible at verification. PAR client should avoid the huge slow down. I don't know why par2cmdline becomes vulnerable again.

Interesting. I wouldn't call it a "DoS attack", but closer to an infinite loop. The program just spins, making almost no progress towards completion.

It is definitely a problem. How can we avoid it? Or is this something that client authors need to be aware of?

It happens when a lot of data matches a rolling hash, but not the associated fingerprint hash(es). The most common version of that is going to be lots of repeated data. I don't think we can ban users from repeating data in a file. ;)

This may just be something that client authors need to be aware of. I'll add it to the spec.

Should we keep the ".volXX-YY" and ".partXX-YY" that go before ".par3"?

For MultiPar, I use ".volXX+YY" format, which is used in QuickPar. XX is starting value and YY is number of included blocks. This format is good to see how many blocks in each PAR file. When each PAR2 file contains 10 recovery blocks; something.vol00+10.par2 something.vol10+10.par2 something.vol20+10.par2

When a user don't want number of blocks, it may use another ".vol_ZZ" format. ZZ is just an index of PAR files. When number of blocks is ignorable; something.vol_1.par2 something.vol_2.par2 something.vol_3.par2

I will change it to follow the ".volXX+YY" de facto standard.

mdnahas commented 3 years ago

I forgot to mention an important matter. 32-bit rolling hash becomes useless, when number of blocks exceeds 2^32.

The rolling hash is just used to prevent running the unique hash on every window. So think of it as a 2^32 time speed up. ;)

The 2^32 time speed-up might be wrong.

...

When PAR3's number of blocks exceeds 2^32 blocks. The collision rate is 1 / 2^(32-32) = 1 always. While sliding every bytes, it needs to re-calculate fingerprint hash of block size. There is no speed-up anymore, and sliding search would be impossible. This must be a problem.

If you use CRC-32 as rolling hash, number of blocks should be 2^32 or less. So, I posted that max 2^32 blocks would be enough for PAR3. If you want to set more blocks like 2^64, you must use CRC-64 for speed-up. Or else, sliding search will cause freeze.

Yes, you're right. I was wrong. If the rolling checksum matches any of the precomputed ones, we have to run the fingerprint hash.

Perhaps we should switch to a 128-bit CRC for the rolling hash? That is what we would need for 2^64 blocks. It would eliminate the 12-byte version of the fingerprint hash. (We can still keep the cryptographic hash for the stream and files.)

mdnahas commented 3 years ago

Nice work in getting so far! Getting the spec to this level is certainly no small feat, so appreciate the persistence.

Thanks!

Do we keep Par2's 16kb hash that was used to identify renamed files? Or will the first block in a file be enough to do that?

My understanding is that the first block may not correspond to a specific file (the block could have multiple files, or it could be the end of a file, etc), so it doesn't sound like the initial file hash is replaceable that way.

If the first block is the same for multiple files, it is likely that their 16kB hash are the same too.

Should the maximum file size be 2^64 long or 2^128 long?

2^64. I don't think any common operating system supports files larger than that, or even volumes that exceed that size. Even support for the idea of a 128-bit integer in programming languages is flaky at best. So much would need to change in the tech space to support >2^64 files (OSes, filesystems, APIs, many protocols and formats etc), I see little movement to adopt anything larger, and I don't think PAR should try to be a pioneer here.

Most file systems' limit individual files to at most 2^64 bytes, but some filesystems can handle up to 2^128 bytes of underlying storage. (Think 2^64 files, each 2^64 bytes long.) My early estimate was that disk drives would be hitting 2^64 by 2040 and our users might want to protect those.

I'll consider limiting files to 2^64 bytes. Going larger than 2^64 bytes seems impossibly huge to me now, but so did a 10mB harddrive in 1984 and 650mB CD-ROM in 1990. Now, my computers now have 1TB SSDs and I feel those are too small!

Packet header overhead is large at 64 bytes. We could:

4 byte magic seems too small, since you want it to be unique and easily identifiable. Then again, you could possibly ditch it completely and rely on the StreamSegementID for magic detection. If you don't like your chances with that, you could combine the StreamSegementID with a small magic field I suppose.

That's what I was thinking. A 4-byte magic next to a 4-byte StreamSegmentID.

Using text for the magic is kinda worrying now. But the string "PAR3" or "PAR\0" is pretty rare in most text.

Type could probably be reduced to one byte if you wanted to squeeze it down, as that should be more than enough to cover all packet types. The downside might be with supporting custom types, but I can't see that being common, and you could come up with a special scheme just for custom packets.

Yes. If we went to 1-byte, I would probably drop custom packets.

I'm thinking of repairing on a per-byte basis and a per-block basis at the same time. But I'm not sure even how that would work, with checksums and aligning blocks

It's an interesting concept, but I can't see how anyone would realistically do this, and for what purpose it'd serve.

Ok.

Should we allow compression of data in Data packets? We could one very simple form of compression, like ZLIB. It's worth asking, but compression will never be our strength

I don't see the point in half-assing it. As long as it's easy for users to insert their own compressor (which should be trivial), there's really no need.

If you want, you could reserve some field to represent 'data transformation' or the like so that a future amendment could add it if it becomes necessary. I personally feel it's being a little too ambitious though.

I'll skip it. We can add it in the future with a new packet type, if anyone wants it.

I am assuming that we can take the K12 hash of the previous segment and calculate what it will be with the new segment's data appended

I'm not sure, but since SHA3 prevents length extension attacks (that SHA2 and earlier are susceptible to), I doubt that's possible.

Oh, that's right. MD5 adds the length of the file to the end of the input stream, before outputtting the hash. Crap.

I guess we're going to have to get inventive there, if we want to append a small bit of data without recomputing the hash for everything that came before it.

Rather than supporting every GF(2^8^?), do we want to hardcode some values and leave it open to extension in the future?

I see little point in requiring decoders support GF(2^1000000), so I definitely think limits on that should be put in place.

Ha!

I feel like writing the code to handle any GF is pretty easy. If we hardcode a few values, it locks us in.

Do we want to say when GF length is 0, the code is XOR?

Unless 0 could have some other meaning, it makes sense to me.

Hmmm... I didn't think ahead about what that means for the Matrix packets. Let me work on this.

The Recovery Data packet and the Data packet are different lengths. Do we want them to be the same length?

I don't see much reason to make them different, though I don't think allowing data packets to have more than one block will be particularly challenging to deal with.

I'm guessing the Recovery Data packet would need to be a multiple of 8 in addition to a multiple of the GF field width, which means GF(2^24) would require block sizes to be a multiple of 24 bytes (unless you allow the last 24-bit integer in the block to be zero padded). If the Data packet had to be the same size, it'd carry the same multiple requirement (responding to your earlier assertion of forcing Data packets to be a multiple of 8 might be limiting).

Yes, but I don't think people will choose weird GF sizes. They'll probably be 1, 2, 4, or 8 bytes.

Should we try to cap the overhead of Data packets to less 40 bytes? Or should we abandon these small scales?

I don't think PAR is efficient for small blocks. If you were targeting such scales, I think the design would be a fair bit different (e.g. smaller checksums, length specifiers etc).

Ok.

I'm not sure if 2-byte lengths everywhere is a good compromise or the worst kind of compromise.

If you're forcing 8 byte alignment, having 2 byte lengths seems inconsistent.

No. I've seen plenty of 4-byte aligned protocols use smaller integers. A 2-byte integer just has to be 2-byte aligned (and the whole packet has to end on an 8-byte aligned boundary).

However, I think I've mentioned before that I think forced alignment/padding is 95% useless (and in fact, almost certainly slows things down despite intentions), but if you're going to do it anyway, it makes sense to at least be consistent everywhere.

But ignoring 8 byte alignment requirements, I think 2 byte length fields are generally fine for all cases you'd use variable length strings.

Ok.

I feel like there will be strange situations where it will be useful to have. E.g., if one of the files being repaired is the /etc/passwd file where the username to UID mapping is stored.

I think that's a very out-of-field problem to try to solve, and doubt anyone can realistically solve it well anyway.

My opinion is to store one or the other - it makes little sense to store both (and in fact, may be undesirable to those looking to maintain privacy with distributing Parchives). I believe TAR only stores one, and I'd imagine clients would have to pick one over the other when writing files, which can cause inconsistencies when clients choose different strategies.

Gnu's version of TAR stores both. And there are files that map the UIDs and usernames from the system the archive was written on, to the system the archive is being restored on. It's confusing. https://www.gnu.org/software/tar/manual/html_node/override.html

Should we keep the ".volXX-YY" and ".partXX-YY" that go before ".par3"?

Funnily enough, I don't think any PAR2 client actually implemented the naming scheme recommended in the specification. (actually, that's a lie - I initially implemented my client using the spec defined naming scheme, which just caused confusion because no-one else used it, so later switched the default to be consistent with other clients)

I don't think the naming scheme is bad, but the de-facto standard has shifted, so you may as well adopt that.

I will use the de-facto standard.

As far as I know, no client implemented PAR2's optional packets nor the application-specific packets. Should we drop them?

I think some used the comment and unicode packets, but the latter were only necessary because encoding of strings was otherwise unclear (an issue PAR3 doesn't have).

I don't think anyone found a use case for the others, so they remain unimplemented.

Ok. I may drop them.

E.g., using default parameters or special packets types?

The notion of forced-default parameters is interesting. PAR-in-PAR sounds complex to me, but if metadata can be protected with a fixed GF/matrix/block-size, then no futher metadata is required to represent metadata protection.

This means that if enough metadata-recovery packets are found, the PAR's metadata can be recovered without needing to worry about critical packets.

Whilst the flexibility in selecting how recovery computation is done, is an interesting feature of PAR3, I'm not sure metadata protection needs the same level of flexibility.

Yeah. I'll see what I can do.

In regards to the name, I feel that PAR3 could cause some confusion, given that it's listed on Wikipedia, and there's pages like these. It might be better to pick a different name to avoid ambiguity.

This thread was started for the discussion of the name. I thought we were okay with PAR3. I guess I'll have to think on that again. If we go with "PAR4", the version after it will have to be "PAR8" then "PAR16" ...

Yutaka-Sawada commented 3 years ago

Do you mean variably-sized integers?

No. I think that 2-byte or 4-byte is possible for less varied data like Packet Type.

Though variable-length quantity is good for small size packet, it may be complex. So, I don't suggest to use variably-sized integers. You seem to favor simple construction.

How can we avoid it? Or is this something that client authors need to be aware of?

In PAR2, client author can solve this problem by disable sliding search for a while. For example, I implemented collision counter at sliding search in my PAR2 client. When collision happens too many times in a region, it skips there, and goes to the next region. The threshold of collision rate depends on a developer.

Perhaps we should switch to a 128-bit CRC for the rolling hash?

No, I don't think so, because 128-bit CRC would be slow in calculation. Using two 64-bit hashes may be faster than 128-bit CRC. For example, a tuple of CRC-64 and 64-bit Multiplicative hash. Each hash value can be calculated with 64-bit integer independently. But, I don't know that the resulting hash value is enough random to distinguish many blocks. CRC-64 (as rolling hash) and 128-bit well known fingerprint hash will be safer.

Algorithm of simple multiplicative hash;
Hash_Value = (Hash_Value + Input_Byte) * PRIME_BASE

Example of 64-bit multiplicative hash;
PRIME_BASE = 1099511628211, (2^40 + 2^8 + 0xb3 from FNV)
Input bytes = [0x01, 0x02, 0x03, 0x04, 0x05]
All calculation in 64-bit integer (mod 2^64 automatically), 2 ^ 64 = 18446744073709551616
1099511628211 ^ 4 = 1099511628211 * 1099511628211 * 1099511628211 * 1099511628211 = 11527715348014283921

Hash vlaue of the first 3 bytes = ((1 * 1099511628211 + 2) * 1099511628211 + 3) * 1099511628211 = 626081712147646998
Slide 1 byte from the first 3 bytes = (626081712147646998 + 4) * 7 - 1 * (1099511628211 ^ 4) = 1251204650155683229
Slide 1 byte again = (1251204650155683229 + 5) * 1099511628211 - 2 * (1099511628211 ^ 4) = 1876327588163719460
This is same as Hash vlaue of the last 3 bytes = ((3 * 1099511628211 + 4) * 1099511628211 + 5) * 1099511628211 = 1876327588163719460

animetosho commented 3 years ago

Most protocols that I know have integers aligned.

Common archive formats I've seen don't do any alignment. For example, ZIP, RAR, TAR, 7Z, BZip2 - all of these have misaligned integers.
Some formats like CAB seem to be all aligned, but it seems to be more coincidence rather than deliberate, and they don't go to the effort of inserting padding to ensure alignment (so will become misaligned if the variable length fields don't match up).

The other bottlenecks happen on input blocks and recovery blocks and use Galois field operations. I think we should keep those aligned.

I think a difference in opinion here is that on-disk format does not (and rarely does) have to match an in-memory representation. No portable implementation can make any assumption about alignment (or even endianness, for that matter), and decoders have to handle inserted/deleted bytes, so can't make any assumptions there either.

Probably repeating myself here, but even if the on-disk format has to match in-memory, 8 byte alignment is almost never sufficient for a high performance SIMD implementation, as they often prefer >=16 byte alignment (ignoring something like ARM's SVE, which throws the whole notion of alignment out the window). My SSE2 implementation of the GF16 kernel for PAR2, for example, requires 16 byte alignment + 256 byte stride multiple, so 8 byte alignment provides no benefit (and becomes less relevant for newer SIMD, e.g. AVX512 prefers 64 byte alignment).

If the first block is the same for multiple files, it is likely that their 16kB hash are the same too.

What I meant was that the first block's hash may contain multiple files concatenated.
I think you were considering making files start on block boundaries, but the current spec sounds like it doesn't require this (only required if a section exceeds 1 block in length).

Consider the example: if the block size is 8KB, and there's two files, each exactly 12KB in size, but contain no duplicate data. We could break the first file into 4+8KB sections, and the second into 6+6KB, and interleave them when mapping to the virtual file. Since no section exceeds the block size, there's no block alignment required, which also means that you can't get a block hash that contains data from only one file.

Yes, but I don't think people will choose weird GF sizes. They'll probably be 1, 2, 4, or 8 bytes.

I actually think GF(2^24) is quite attractive. It allows more blocks than GF16, without being as slow as GF32. Beyond GF32, there may not be much reason to use anything less than GF64.

Gnu's version of TAR stores both. And there are files that map the UIDs and usernames from the system the archive was written on, to the system the archive is being restored on.

Thanks for the info. My (possibly incorrect) understanding is that the TAR format doesn't require both, but a client can optionally do it.
Having both could be beneficial, but it might be worth putting a direction on which should be used when recovering. TAR's ability to use custom mapping files seems to be on the route of going down a rabbit hole of supporting edge cases.

This thread was started for the discussion of the name. I thought we were okay with PAR3. I guess I'll have to think on that again

Sorry about mentioning it so late - I generally leave naming things til the end.

On the flip side, I don't think the "existing PAR3" has gained enough traction to matter that much (and most references to it are fairly old/outdated at this point).
In other words, I don't think calling it PAR3 is a major issue, but I thought I'd point it out nonetheless.

mdnahas commented 3 years ago

Do you mean variably-sized integers?

No. I think that 2-byte or 4-byte is possible for less varied data like Packet Type.

Oh, I thought you were talking about a different field in the header. Yeah, for Packet Type, we can do 1 byte (or less!).

Though variable-length quantity is good for small size packet, it may be complex. So, I don't suggest to use variably-sized integers. You seem to favor simple construction.

I definitely prefer simple construction. :)

I was also against variable-sized integers because they throw off alignment. But since we're (mostly) giving up on alignment, they are now an option.

Having thought it over, I don't think variable sized integers are worth doing. We want large hashes for input files and for the entire set of input files. We'd love smaller hashes for blocks, but I don't think we get can get it. Blocks need a hash that is at least twice the size of the number of blocks (to avoid The Birthday Problem) and, actually, larger than that so that we can assume the blocks are unique. So, even if we had a very small number of blocks --- say 2,000 --- then we'd need 2-bytes to hold block indexes and much more than 4-bytes for block hashes. So, block hashes would have to be at least 8-bytes. And I don't think we get that much of a savings by going from 16-bytes vs. 8-bytes. If we could have gotten away with a 4-byte hash for blocks, it would have been worth it.

How can we avoid it? Or is this something that client authors need to be aware of?

In PAR2, client author can solve this problem by disable sliding search for a while. For example, I implemented collision counter at sliding search in my PAR2 client. When collision happens too many times in a region, it skips there, and goes to the next region. The threshold of collision rate depends on a developer.

Okay. I'll just add it to specification as a thing to look out for.

Perhaps we should switch to a 128-bit CRC for the rolling hash?

No, I don't think so, because 128-bit CRC would be slow in calculation. Using two 64-bit hashes may be faster than 128-bit CRC. For example, a tuple of CRC-64 and 64-bit Multiplicative hash. Each hash value can be calculated with 64-bit integer independently. But, I don't know that the resulting hash value is enough random to distinguish many blocks. CRC-64 (as rolling hash) and 128-bit well known fingerprint hash will be safer.

Ok.

If we have to split it, I think I'd rather go with a CRC and a cryptographic hash. It makes it harder for someone to manufacture a block with the same hash value.

mdnahas commented 3 years ago

If we have to split it, I think I'd rather go with a CRC and a cryptographic hash. It makes it harder for someone to > manufacture a block with the same hash value.

Oh, wait, we wanted a 128-bit rolling hash to avoid the collisions that lead to the slowing down. Could we reuse the 64-bit CRC by feeding in a transformed version of the data?

Or we can pick an algorithm from this rolling hash library.

BTW, I feel like misalignment is a rare issue. I wish we could just do block-aligned checksums. I know about a concept called content-based slicing that can handle less-common alignment issues. But I haven't figured out how to include it yet. Content-based slicing usually works best with near-random data, so data that's been compressed or encrypted. So it may never work for us.

mdnahas commented 3 years ago

Common archive formats I've seen don't do any alignment. For example, ZIP, RAR, TAR, 7Z, BZip2 - all of these have misaligned integers. Some formats like CAB seem to be all aligned, but it seems to be more coincidence rather than deliberate, and they don't go to the effort of inserting padding to ensure alignment (so will become misaligned if the variable length fields don't match up).

Compression formats are the exception. They care about saving every byte and don't care about execution speed, so they ignore alignment.

The other bottlenecks happen on input blocks and recovery blocks and use Galois field operations. I think we should keep those aligned.

I think a difference in opinion here is that on-disk format does not (and rarely does) have to match an in-memory representation. No portable implementation can make any assumption about alignment (or even endianness, for that matter), and decoders have to handle inserted/deleted bytes, so can't make any assumptions there either.

Probably repeating myself here, but even if the on-disk format has to match in-memory, 8 byte alignment is almost never sufficient for a high performance SIMD implementation, as they often prefer >=16 byte alignment (ignoring something like ARM's SVE, which throws the whole notion of alignment out the window). My SSE2 implementation of the GF16 kernel for PAR2, for example, requires 16 byte alignment + 256 byte stride multiple, so 8 byte alignment provides no benefit (and becomes less relevant for newer SIMD, e.g. AVX512 prefers 64 byte alignment).

A good point. I'll add that to the note in the specification on alignment.

If the first block is the same for multiple files, it is likely that their 16kB hash are the same too.

What I meant was that the first block's hash may contain multiple files concatenated. I think you were considering making files start on block boundaries, but the current spec sounds like it doesn't require this (only required if a section exceeds 1 block in length).

Consider the example: if the block size is 8KB, and there's two files, each exactly 12KB in size, but contain no duplicate data. We could break the first file into 4+8KB sections, and the second into 6+6KB, and interleave them when mapping to the virtual file. Since no section exceeds the block size, there's no block alignment required, which also means that you can't get a block hash that contains data from only one file.

I don't know why you would interleave them when mapping the stream (a.k.a. single virtual file). Unless blocks overlap or doing "PAR inside", I expect files to be mapped in a single piece.

And if the files are smaller than a single block, you hash the complete file and check the File packets.

Yes, but I don't think people will choose weird GF sizes. They'll probably be 1, 2, 4, or 8 bytes.

I actually think GF(2^24) is quite attractive. It allows more blocks than GF16, without being as slow as GF32. Beyond GF32, there may not be much reason to use anything less than GF64.

Ok. The current version of the spec is fine with GF(2^24). I don't think restricting blocks to be a multiple of 24 bytes is very restrictive.

This thread was started for the discussion of the name. I thought we were okay with PAR3. I guess I'll have to think on that again

Sorry about mentioning it so late - I generally leave naming things til the end.

On the flip side, I don't think the "existing PAR3" has gained enough traction to matter that much (and most references to it are fairly old/outdated at this point). In other words, I don't think calling it PAR3 is a major issue, but I thought I'd point it out nonetheless.

Ok. I'm going with PAR3 for the moment.

Yutaka-Sawada commented 3 years ago

I still think that max number of blocks would be 2^32, instead of 2^64. In the age of PAR1, number of blocks was 2^8 range. In the age of PAR2, number of blocks was 2^16 range. So, 2^24 range is possible for PAR3. (256 times more than PAR2.) Setting max number as 2^32 will be enough for current usage.

I don't know what will happen in far future. I just think that PAR3 should be designed for now (or near future). If future people want to set more than 2^32 blocks, it will be the time to make next PAR4 with 2^64 blocks.

animetosho commented 3 years ago

Compression formats are the exception. They care about saving every byte and don't care about execution speed, so they ignore alignment.

The part about saving bytes is largely true, but I strongly disagree that they don't care about execution speed. Performance is a major concern with compression (otherwise everyone would just use something like PAQ), particularly with how widely and frequently it's used.

I can't think of many commonly used binary formats that aren't really "compression formats" (text formats, of which there are many, obviously don't care about alignment).
What formats/protocols, are you thinking of, that force alignment by explicitly inserting padding?

I don't know why you would interleave them when mapping the stream (a.k.a. single virtual file). Unless blocks overlap or doing "PAR inside", I expect files to be mapped in a single piece.

It doesn't seem useful, but if the spec allows it, a reader must consider and support it.

The possibility could result from appending data, though I don't see it as likely.

Setting max number as 2^32 will be enough for current usage.

I also think 2^32 is more than enough for typical PAR use cases.

I do wonder though, if having more blocks is less expensive (via sparse matrix encoding, for example), it could make sense to have relatively small blocks.
Due to overheads, it probably can't be too small, but something like a 4KB block size doesn't seem too silly.
So for the way PAR is currently used, 2^32 is plenty, but if the new design changes how people use it, I can see benefit in 2^64. (correct me if my assumptions are wrong)

mdnahas commented 3 years ago

I still think that max number of blocks would be 2^32, instead of 2^64. In the age of PAR1, number of blocks was 2^8 range. In the age of PAR2, number of blocks was 2^16 range. So, 2^24 range is possible for PAR3. (256 times more than PAR2.) Setting max number as 2^32 will be enough for current usage.

I don't know what will happen in far future. I just think that PAR3 should be designed for now (or near future). If future people want to set more than 2^32 blocks, it will be the time to make next PAR4 with 2^64 blocks.

I specifically designed Par2 so that it would go beyond Par1's current usage. Par1 was only for Usenet and only when paired with RAR or a similar program to chop up files. Par2 was used for multi-DVD backups, protection on a single drive, and more. I don't know all the uses for Par3, so I've tried to keep it as expansive as possible, while adding features to our current users.

I agree that 2^32 blocks is probably way more than any current user needs. But if you chop a 16TB harddrive into 4kB blocks, I believe you get 2^32 blocks. (2^44 / 2^12 = 2^32) Given that Par2 lasted 20 years, I think going to 2^64 blocks is appropriate.

mdnahas commented 3 years ago

Compression formats are the exception. They care about saving every byte and don't care about execution speed, so they ignore alignment.

The part about saving bytes is largely true, but I strongly disagree that they don't care about execution speed. Performance is a major concern with compression (otherwise everyone would just use something like PAQ), particularly with how widely and frequently it's used.

I can't think of many commonly used binary formats that aren't really "compression formats" (text formats, of which there are many, obviously don't care about alignment). What formats/protocols, are you thinking of, that force alignment by explicitly inserting padding?

IP V4. The optional headers in an internet protocol packet are variable size. The specification requires padding to make sure the IP header is 4-byte aligned. Any protocol running on top of IP V4, like UDP and TCP, can be sure that their values (like 2-byte port numbers and 4-byte sequence numbers) are aligned. IP V4 Specification

I don't know why you would interleave them when mapping the stream (a.k.a. single virtual file). Unless blocks overlap or doing "PAR inside", I expect files to be mapped in a single piece.

It doesn't seem useful, but if the spec allows it, a reader must consider and support it.

Touche'.

Yes, a file could map only part of a block and we would want to find the file.

I do wonder though, if having more blocks is less expensive (via sparse matrix encoding, for example), it could make sense to have relatively small blocks. Due to overheads, it probably can't be too small, but something like a 4KB block size doesn't seem too silly. So for the way PAR is currently used, 2^32 is plenty, but if the new design changes how people use it, I can see benefit in 2^64. (correct me if my assumptions are wrong)

I agree.

mdnahas commented 3 years ago

Here are my edits so far. Par3_spec.txt

A big open item is decisions on the rolling and fingerprint hashes. But I think I found a bigger issue.

One of the first steps in repair is to search for good input blocks. The current design has the checksums (both rolling and fingerprint) with the stream blocks. And if a stream block is fully contained in a file, that's fine. We can pass a rolling hash over the file and find the block. But if a stream block only contains part of a file (e.g., the trailing end of a file) or part of multiple files (if we pack the ends of multiple files into a single block), then we are not going to be able to find the good input blocks.

It definitely feels wrong to have the rolling checksums be on the stream blocks. The rolling checksums are meant to detect alignment issues in the files. Those are definitely in the wrong place.

I kinda like a checksum on the stream blocks, since that's how we detect which input stream blocks need to be recovered.

I've also been wondering if we could get by with fewer checksums. I tried discussing it with a theoretical computer scientist I know, but we didn't get very far than saying it was mathematically possible. Perhaps I should talk to a mathematical expert.

But the simple solution is to move the rolling checksum to the files, like in Par2. And it's probably simplest to follow Par2 and only have file-block checksums and not stream-block checksum.

Thoughts?

animetosho commented 3 years ago

IP V4.

Actually, IPv4 forcing alignment makes a lot of sense, as it needs to be designed to work on microcontrollers, NICs, routers etc and isn't an on-disk format. Though one might argue that back in 1981, when it was designed, these use cases weren't thought of (which is why IPv4 address exhaustion is a problem), but such an old design is a little out of touch with technology changes in the last 4 decades. TCP/UDP are probably largely in the same boat.

If compression formats are an exception, I'd say networking has its own unique challenges that don't meld so well with PAR (even if there's a streaming format).
Still, if you look at more modern networking protocols, you'll see they often don't bother with forcing alignment. Examples:

QUIC - starts off with a 1 byte type, making all subsequent multi-byte values misaligned
TLS - it starts with a 1 byte type, just like QUIC
uTP - the header is aligned, but extensions aren't required to be padded. Only thing that needs to be padded would be the selective ACK, which makes sense to ease programming (since bitfields need to be padded anyway)
HTTP/2 - frames start with a 3 byte length, and the frame header itself is 9 bytes

But if a stream block only contains part of a file (e.g., the trailing end of a file) or part of multiple files (if we pack the ends of multiple files into a single block), then we are not going to be able to find the good input blocks

My guess is that it should be possible under certain circumstances, as the reader will often have some idea of where the data should be. It could also be less of a rolling hash in some cases, if it can assume the file boundaries are correct.
If the reader can identify the which files the block contains, it shouldn't even need to check the rolling hash, but if it can't, perhaps it could try to brute force the problem.

For example, if a block contains the last 50 bytes of file A, followed by the first 40 bytes of file B. A reader could assume that the file boundaries are fixed, which means that it doesn't actually need to roll. Instead, it could compute the rolling hash of the last 50 bytes and first 40 bytes of every file, and see which two combine to give the target rolling hash.
In such an instance, the rolling hash isn't really used for rolling, but used for its ability to concatenate.

If there are multiple files, this could get more problematic due to brute force complexity. If this is due to files being fully contained in the block, a reader could use the fingerprint hash to simplify combinations needed to be checked during a brute force search.

Regardless, this does add complication to decoding/repairing.

As for my opinion at this stage: this single virtual file mapping idea is challenging to deal with (decode side) and introduces a bunch of corner cases.
I think you mentioned a reason for the virtual file is to support deduplication; I personally don't think it's that important a feature (it's quite limited in what can be achieved, and if space is a concern, compression does a better job at dedupe). The other main benefit of packing files together would be to reduce overhead of zero padding blocks on end-of-file boundaries.

I do like the simplicity of block aligned files in PAR2, as it's much easier to deal with and think about. So if you don't consider deduplication important, the choice would be whether you want to support tight packing at the expense of a more complex design.

I've also been wondering if we could get by with fewer checksums

Checksums on input data or packets?
On input data, I think there's currently three checksums (block rolling, block fingerprint, file fingerprint) plus a checksum for the initial segment of a file.
If you want a rolling and crypto-hash, you probably can't reduce this below two checksums.

Yutaka-Sawada commented 3 years ago

8 StreamSegmentID All packets that belong together have the same StreamSegmentID.

Because StreamSegmentID is same usage as "Recovery Set ID" in PAR2, it's good to be 16-bytes, too. Even when 8-bytes would be enough as a global unique value, it's less random than PAR2's 16-bytes. This isn't a matter of enough or not. Users will hope that PAR3 is better than (or at least equal to) PAR2.

8 byte[8] Packet type. Can be any value.

It's good to put Packet type in 8-bytes. Now, I can keep it in single 64-bit (8-bytes) Integer. So, it doesn't need to be ASCII null terminated string. Comparing as 64-bit integer is easier (simpler and faster) than string format.

For example, "PAR DAT\0" becomes 0x0054414420524150 (as little endian). For this comparison, the last "\0" isn't required to be null-string. You may use the last character freely, such like "PAR DATA" (0x4154414420524150).

I've also been wondering if we could get by with fewer checksums.

When I thought about PAR3 ago, I came up with a simple way. It's possible to classify slices to two classes on a file. They are "indexed input file slices" and "following input file slices". "indexed input file slice" has rolling hash to locate the position on a damaged file and fingerprint hash to check integrity. "following input file slice" has fingerprint hash only and followed after indexed slice. When a PAR client find "indexed input file slices", it checks following slices.

In PAR2, each "input file slices" has same checksums (CRC-32 and MD5). These slices are all indexed, such like [index][index][index][index]. With two classes, slices are aligned like [index][follow][index][follow]. Because PAR3 will have more blocks than PAR2, no need to locate all slices. After it finds an index by sliding search, the position of following slices is known.

There may be some "following input file slices" after a indexed slice. Because following slices have fingerprint hash only, total size of all slice checksums can be smaller. When there are many slices, the difference may be noticeable. For example, there are 2^30 slices. All slices' 128-bit hash consume 16 GB space. When all slices have 64-bit rolling hash also, the space is 8 GB. If one indexed slice has 7 following slices, the number of rolling hashes becomes 1/8. The total checksum size becomes 17 GB as compared to full size 24 GB.

In this method, rolling hash don't need to be calculated for full block size. For example, the first 16 kb of a slice may be enough. By setting fixed short window size, sliding search will become faster. It uses rolling hash as index of many slices. Because it uses fingerprint hash at the first calculation mostly, the hash should be as fast as rolling hash.

it's probably simplest to follow Par2 and only have file-block checksums and not stream-block checksum.

You are right. I didn't care the stream usage. (As I'm a developer of an application for file usage.)

A packet of each file information should keep checksums of input file slices. Also, it's possible to map some slices in a block. You can map some very small files in a block, too. Each file information may have a special item for the last small slice. They are like following;

Number of normal size slices : File size / block size Starting index of blocks : Where normal size slices are put on blocks' stream Size of the last small slice : File size mod block size. If this is 0, no following items. Index of block : Which block the small slice is put. Offset in the block : Where the small slice is put in the above block.

animetosho commented 3 years ago

Went through the spec, and listed some things that poked my curiosity.

Every byte of a PAR file is specified. There are no places to throw junk bytes that can be any value.

Actually, with randomness being accepted in places, this isn't entirely true.
Overall, I think the statement isn't really necessary and can be confusing, because a recovery format is meant to be able to handle junk bytes thrown at it.

16 fingerprint hash K12 Hash of packet. Used as a checksum for the packet

I believe K12 prefers hashes to be 256-bit in size, but supports any length.
16 bytes feels a little short for a cryptographic hash. If 32 is too big, what about 24 bytes?

the contents of the creator packet MUST be shown to the user
It is REQUIRED that the client get user approval
Clients are also REQUIRED to ask for permission when linking to files outside a subdirectory

I'd replace MUST/REQUIRED with SHOULD. The above is only possible if there actually is a user (something an automated system may not have).

[Creator] UTF-8 text identifying the client, options, and contact information

As the options and contact info is only recommended, I presume there's no suggested way of delimiting them in the string?
I would've thought that options would at least be a separate string, so that if a client would display the creator, it wouldn't have to spam the user with a long list of options as well.

[Segment start] The first field identifies the starting location of the segment in the stream. If this is the first segment in a stream, the starting index is set to 0. Otherwise, the starting index MUST equal the stream's length at the end of the preceding segment.

This seems to make sense in the streaming context, but for file context, where files can be mapped to arbitrary locations, what does the starting location refer to?

8 unsigned int The size of the Galois field in bytes.

I still think a limit should be placed on GF size. I get that it may seem limiting, but despite how long ECC has been around, I don't think anyone has come up with any legit use case for >GF128, and going beyond GF64 seems highly questionable. Being too flexible is also limiting, as it restricts assumptions developers can make.
Sticking to <=GF64 is kinda nice, because anything larger than int64 is problematic for many languages, and handling >GF64 becomes more difficult if you can't assume values don't fit in a register.

Setting a limit would also mean you won't need 8 bytes to hold the width. If you don't set a limit, I'd expect client authors to set their own limits anyway, meaning the spec gets subverted by a de-facto limit.

[Data Packet] 8 unsigned int The block index within the input stream.

How does this relate to the value in the stream start packet? Is it an offset of that number (i.e. always starts at 0), or is it relative to the beginning of the stream?
I'm guessing the External Data packet's value follows a similar principle.

Data packets contain an entire block's worth of data. If the data's length is less than the block size, the rest of the block will be filled with zero bytes. [...]
NOTE: This packet is only used to send a complete block of data. If a stream segment ends mid-block

If the Data packet can only send complete blocks, how can the packet size be less than the block size?

[External data] 16*? {rolling hash, 12-byte fingerprint hash} A rolling checksum and finger print for each input block

I think the K12 hash size should be consistent everywhere. 12 bytes seems weird if it's 16 bytes everywhere else.

[Segment End] If this is the first segment in the stream, it is the K12 hash of 16 bytes of zeros followed by the data in the segment.

Is this still the case if the "previous segment's" hash was randomly generated?

Also, does the hash include the trailing data, or only blocks included in Data packets? If operating in file mode, I presume there is never any trailing data as everything is handled by External Data packets, so will be a concatenation of all data?

This hash does put an upper limit of performance achievable though - if K12 maxes out at, say, 3GB/s, you'll never be able to create a PAR3 faster than 3GB/s, regardless of how fast the disk or however many CPU cores are available. With PAR2, a client could at least hash files in parallel to try to circumvent this limit.

More than one [matrix packet] type can be used at the same time

Can the same type be used multiple times?

Note: The 8-bit Galois field is supported by the x86 instruction GF2P8MULB.

I did some more research and found that the 0x11b polynomial isn't primitive so generally not preferred for RS coding.
Most implementations of GF(256) I've seen use 0x11d (which can still be accelerated by GFNI, but not via the multiply instruction).

Note: All 64-bit Galois fields are supported by the x86 instruction CLMUL.
Note: The ARM processor has an instruction extension called "NEON" with a VMULL instruction. VMULL.P8 will do eight 8-bit Galois field multiplications at once.

Thought I'd point out that neither of these instructions do reduction, which is why they're called "carryless multiplication" (x86) or "polynomial multiplication" (ARM) as opposed to GF multiplication.
They can be used for reduction, but often require a few rounds. The lack of efficient reduction can often be a performance limiter on approaches that rely on these instructions.

[Cauchy] The hint to the number of recovery blocks is used in single-pass situations to allocate buffers. If the number of rows is unknown, the hint is set to zero.

It might be worth making the hint a little stricter, for example, the number is the highest number of recovery blocks that the encoding application knows is available, assuming no corruption occurs (e.g. if the creator is making 100 recovery blocks, the number here should be 100).

How does this work with split PAR files? Should the value be different for every volume?

Otherwise, the matrix's element for input block I and recovery block R depends on I and R. (NOTE: This specification uses zero-index vectors, so I and R start at 0.) Specifically, it is the multiplicative inverse of x_I-y_R, where x_I is the Galois field element with the same bit pattern as binary integer I+1 and y_R is the Galois field element associated with the binary integer MAX-R, where MAX is the maximum binary integer value with the same size as the Galois field. (NOTE: In binary, MAX contains all ones.) To be clear, the multiplicative inverse and subtraction x_I-y_R are done using Galois field arithmetic. The I+1 addition and MAX-R subtraction is done using native integer arithmetic.

So to put it another way, the coefficient is inv( (I+1) ^ (~0-R) ) = inv( (I+1) ^ ~R ) = inv( (-2-I) ^ R ) where inv is the GF inverse and ^ denotes XOR?

[Sparse Random] 8 unsigned int maximum number of recovery blocks

I'm not familiar with sparse random matricies, but does this mean that additional recovery cannot be generated after initial creation?

[Explicit matrix] For each input block that is used to calculate the recovery block, there is a pair of the index of the input block and its Galois Field factor.

If data is being appended (i.e. more input blocks), does this block need to be updated?

For each input block that is used to calculate the recovery block, there is a pair of the index of the input block and its Galois Field factor. The pairs are in sorted order, with input block indices increasing from lowest to highest.

I presume the inputs not specified are presumed to be zero?

[Recovery Data] When calculating recovery data, only complete input blocks are used. (This is to prevent having two different values for a input block when a stream is appended to.)

In the streaming context, the trailing data doesn't have any associated recovery, but in the file context, it does, right?

?*8 data The recovery block data

I'm guessing the recovery block doesn't need to be a multiple of 8 here.

[File] Sections of the file are mapped to the input stream. Each section is specified by 4 values. The first value is an offset in the file. The second value is the length of the section to be mapped.

Are we limiting files to 2^64? If so, the second number should be 8 bytes long.

Root Packet

If this points to a directory, what should that directory's name be (if relative paths are being used)?

The checksum of the Checkpoint packet identifies the data in the stream

I don't see any Checkpoint packet type defined?

It is REQUIRED that the user's permission be granted for any action that might jeopardize their security.

A client won't know what might jeopardize security. Maybe writing a text file causes some external application to buffer overflow and do who knows what - you never know.
As such, I think the suggestion there is too strong. You can encourage good security practice, but a client can't predict everything, and this spec cannot dictate how an application chooses to deal with their users (assuming there are any).

[Unix File] 8 signed int atime, nanoseconds since the Epoch

I'm guessing this is UTC from what's specified in the directory version (it's omitted from the file specification).

Also, the UNIX directory packet doesn't have xattrs specified.

If any UNIX File packet is present in a stream segment, then there is a UNIX File packet for every File packet

This could feel limiting if files happen to be spread across multiple filesystems.

There is a checksum for every UNIX File packets for each file in the directory

An alternative may be to only list files that haven't been listed in the regular directory packet (avoids relisting the entire folder).

UNIX Root Packet

Is this necessary? From what I can gather, the root should be determinable via regular Root -> Directory <- Unix Directory linking (or Root -> File <- Unix file, if the root is a single file).

There's also a question about how attributes of the root should be handled if the fact that it's a directory is largely irrelevant.

users be warned when they create PAR files with names that are incompatible with Windows, Mac, or Linux systems. That is, file or directory names that are more than 255 characters long, start with a period (.) or a dash (-), or contain one of these characters: < > : " ' ` ? * & | [ ] \ ; or newline (\n).

Are there actually any systems that don't like the following?

names that start with a . or - (other than those named '.' or '..')
name containing ' ` & [ ] ;

The list of characters restricted by Windows is \ / : * ? " < > | (plus probably all ASCII characters < 32).

On Windows, an absolute path can start "C:\" or "//" for example

UNC paths start with "\\"

For UNIX, that means one starting with "/" or "//".

"//" starts with "/" so the latter doesn't need to be stated.

using a feature like ".." in a path

Is there any reason to even allow "." or ".." as a name? The only case where I can see that making sense and being present would be paths for symlinks or hardlinks. Directories/files should otherwise never be named as such, so a decoder could just mark those as invalid upon seeing them.

In order for a client to do single-pass recovery, the recommended order of packets for the stream use-case is:

The way the Explicit Matrix packet is defined, I'm not sure it can be placed before Segment End, as it requires knowing the number of input blocks?

To make PAR inside work, the PAR packets need to contain one File packet, which refers to the file itself. Thus, the name in the File packet must match the name of the file

I liked the idea of using a blank filename instead, as it allows the file to be renamed.

Some questions:

How to handle duplicate files/directories (perhaps placed under different directories) if the packet's fingerprint hash is identical for each copy?
Does the decoder need to know whether a protected file is complete or not? At the moment, the only indicating factors would be if there's no mapping for the first byte, or there are visible holes in the mapping. If the file is incomplete because the end isn't mapped, I don't think the client has any way of knowing.
Would you happen to know anything about composite GF fields?

mdnahas commented 3 years ago

Still, if you look at more modern networking protocols, you'll see they often don't bother with forcing alignment. Examples:

* [QUIC](https://datatracker.ietf.org/doc/html/rfc9000#section-17.2.1) - starts off with a 1 byte type, making all subsequent multi-byte values misaligned
* [uTP](http://bittorrent.org/beps/bep_0029.html) - the header is aligned, but extensions aren't required to be padded. Only thing that needs to be padded would be the selective ACK, which makes sense to ease programming (since bitfields need to be padded anyway)
 ...

True.

Right now, the spec has no alignment requirements. The spec says that alignment may be an issue for some clients. And clients can, if they choose, insert 0-byte padding between packets. It is completely optional.

But if a stream block only contains part of a file (e.g., the trailing end of a file) or part of multiple files (if we pack the ends of multiple files into a single block), then we are not going to be able to find the good input blocks

My guess is that it should be possible under certain circumstances, .... ... Regardless, this does add complication to decoding/repairing.

As for my opinion at this stage: this single virtual file mapping idea is challenging to deal with (decode side) and introduces a bunch of corner cases. I think you mentioned a reason for the virtual file is to support deduplication; I personally don't think it's that important a feature (it's quite limited in what can be achieved, and if space is a concern, compression does a better job at dedupe).

I have a lot of duplicate files on my system. They may not take up a lot of space, but they're there. So we have to address the issue of duplicate fingerprint hashes. Also, if we allow incremental backups, there will probably be a lot of duplicate blocks between two different versions of the same file. It seems foolish to not de-duplicate those.

But we can restrict deduplciation to being exactly copies of aligned blocks. Not any-length region at any position.

The other main benefit of packing files together would be to reduce overhead of zero padding blocks on end-of-file boundaries.

Yes. If we have many small files or many ends-of-files, recovery would be improved by packing them tightly into blocks. E.g., 1000 small files might fit in 1000 blocks separately but 500 blocks if packed together. With 10 recover blocks, you can recover at most 10 files if the files are each in their own block, but (in the best case) twice that if they are packed together. If each file has its own checksum, you can do sub-block recovery for lost files, which means you might actually get close to the best case.

Still, doing sub-block recovery is hard. And the research on file systems says that the lengths of files on disks tend follow a log-normal or power-law distribution. That means there are lots of short files but the long files are really long. So, the usual case will lots of full blocks coming from very long files.

But that doesn't preclude file sets with lots of small files. I believe ReiserFS did a special case for small files. EXT4 stored files less than 60 bytes directly in the inode. Maybe we separate those out as a special case?

I do like the simplicity of block aligned files in PAR2, as it's much easier to deal with and think about. So if you don't consider deduplication important, the choice would be whether you want to support tight packing at the expense of a more complex design.

I'll think on it as I redo the checksums.

I've also been wondering if we could get by with fewer checksums

Checksums on input data or packets? On input data, I think there's currently three checksums (block rolling, block fingerprint, file fingerprint) plus a checksum for the initial segment of a file. If you want a rolling and crypto-hash, you probably can't reduce this below two checksums.

I would like fewer checksums on blocks. There are a lot of blocks so any per-block overhead causes a lot of storage. It prevents us from doing really small blocks.

And the math tells me we don't need so many checksums:

The rolling checksum is designed to fix problems of misalignment. Each misalignment is associated with a damaged block. But if we are only creating 5 recovery blocks, we can only handle 5 damaged blocks. That means at most 5 misalignments. But, currently, we have a rolling checksum for every block. That is a lot.

The fingerprint checksum is designed to find bad blocks. If there are N blocks and 1 of them is bad, we need only log2(N) checksums to find out which one is bad. (See Hamming Codes for how this is done with parity.) If we send R recovery blocks, then we need to know the location of at most R bad blocks. To locate R blocks, we need only log2(N choose R) checksums where "N choose R" = "N! / (R! * (N-R)!)". There's a stack exchange post on it.

So, under the current design, we're sending 2 checksums for every block. And the math says we get by with a lot fewer. I just don't have a design that does it yet.

mdnahas commented 3 years ago

8 StreamSegmentID All packets that belong together have the same StreamSegmentID.

Because StreamSegmentID is same usage as "Recovery Set ID" in PAR2, it's good to be 16-bytes, too. Even when 8-bytes would be enough as a global unique value, it's less random than PAR2's 16-bytes. This isn't a matter of enough or not. Users will hope that PAR3 is better than (or at least equal to) PAR2.

The RecoverySetID was a hash of the recovery set. It had meaning. It also meant that the RecoverySetID was the same even if different people made the par2 file.

The StreamSegmentID is just a random number. It's only job is to be unique if its a different file. I doubt more than 2^32 StreamSegmentIDs will be created and, with 64-bits, it is less than a 1 in 2^32 chance of a collision. That's enough, I think. Besides, I doubt there will be many cases where we even need to resolve a collision.

8 byte[8] Packet type. Can be any value.

It's good to put Packet type in 8-bytes. Now, I can keep it in single 64-bit (8-bytes) Integer. So, it doesn't need to be ASCII null terminated string. Comparing as 64-bit integer is easier (simpler and faster) than string format.

For example, "PAR DAT\0" becomes 0x0054414420524150 (as little endian). For this comparison, the last "\0" isn't required to be null-string. You may use the last character freely, such like "PAR DATA" (0x4154414420524150).

I did that so that if I run "strings filename.par3 | grep 'PAR'", I can see what packets are in a file and the order of them.

I've also been wondering if we could get by with fewer checksums.

When I thought about PAR3 ago, I came up with a simple way. It's possible to classify slices to two classes on a file. They are "indexed input file slices" and "following input file slices". ...

That approach would work, but I don't think it gets us much. At the moment, we have a 16-byte fingerprint hash for each block and we sneak the rolling hash in, as part of the fingerprint hash. So, I don't think we have that much of a penalty for each rolling hash.

it's probably simplest to follow Par2 and only have file-block checksums and not stream-block checksum.

You are right. I didn't care the stream usage. (As I'm a developer of an application for file usage.)

A packet of each file information should keep checksums of input file slices. ...

I want to keep the file slice checksums in a separate packet. Or packets.

First, if we come up with a better way to do the checksums, I want to add a new packet type without replacing the file packet.

Second, I want packets to be small. I want it to be possible for a client to put most packets into a 1500 byte Ethernet frame. If a large file has lots of blocks, its file packet might be very long. Therefore, I want a different packet type to store the checksums.

Also, it's possible to map some slices in a block. You can map some very small files in a block, too. Each file information may have a special item for the last small slice. ....

I like the idea of allowing a different block and offset for the last fractional block. I'll play with the idea.

mdnahas commented 3 years ago

Went through the spec, and listed some things that poked my curiosity.

Thanks for taking the time!

Every byte of a PAR file is specified. There are no places to throw junk bytes that can be any value.

Actually, with randomness being accepted in places, this isn't entirely true. Overall, I think the statement isn't really necessary and can be confusing, because a recovery format is meant to be able to handle junk bytes thrown at it.

You are right. With randomness, there are bytes that the client gets to choose.

The line is a hold over from Par2. In that specification, the packets were a deterministic function of the input set. That was a good property because two different clients on different sides of the world should generate the exact same packets. We could verify output by comparing packets byte-for-byte.

But that is no longer the case. I'll drop the line.

16 fingerprint hash K12 Hash of packet. Used as a checksum for the packet

I believe K12 prefers hashes to be 256-bit in size, but supports any length. 16 bytes feels a little short for a cryptographic hash. If 32 is too big, what about 24 bytes?

MD5 was 8-bytes. 16-bytes seems long!

We support 2^64 blocks and, to avoid the Birthday Problem, we need block hashes to be at least twice 64-bits. So 16-byte hashes for blocks is a minimum.

Every byte is overhead. Especially when added to every block and/or packet. So I'm hesitant to add more to blocks and packets.

If you want to talk about using longer hashes for files and directories, I would consider that.

the contents of the creator packet MUST be shown to the user It is REQUIRED that the client get user approval Clients are also REQUIRED to ask for permission when linking to files outside a subdirectory

I'd replace MUST/REQUIRED with SHOULD. The above is only possible if there actually is a user (something an automated system may not have).

I would say that an automated system does have a user. Somewhere. No system is fully automated --- when the system breaks, someone shows up.

[Creator] UTF-8 text identifying the client, options, and contact information

As the options and contact info is only recommended, I presume there's no suggested way of delimiting them in the string? I would've thought that options would at least be a separate string, so that if a client would display the creator, it wouldn't have to spam the user with a long list of options as well.

I thought about having separate strings for the client name, options, and contact info. If we did that, we'd need to specify a format for "the contact information". E.g., should it be an email address or a URL or something else. I decided that if everything is just going to be shown to the user, it might as well put everything in a single string.

[Segment start] The first field identifies the starting location of the segment in the stream. If this is the first segment in a stream, the starting index is set to 0. Otherwise, the starting index MUST equal the stream's length at the end of the preceding segment.

This seems to make sense in the streaming context, but for file context, where files can be mapped to arbitrary locations, what does the starting location refer to?

It just represents the starting location for new data. It is not that important in the file context.

8 unsigned int The size of the Galois field in bytes.

I still think a limit should be placed on GF size. I get that it may seem limiting, but despite how long ECC has been around, I don't think anyone has come up with any legit use case for >GF128, and going beyond GF64 seems highly questionable. Being too flexible is also limiting, as it restricts assumptions developers can make. Sticking to <=GF64 is kinda nice, because anything larger than int64 is problematic for many languages, and handling >GF64 becomes more difficult if you can't assume values don't fit in a register.

Setting a limit would also mean you won't need 8 bytes to hold the width. If you don't set a limit, I'd expect client authors to set their own limits anyway, meaning the spec gets subverted by a de-facto limit.

Good points.

I had assumed that every client would implement generic GF code that worked on bytes. And also implement a handful of special-case GFs that would be optimized.

[Data Packet] 8 unsigned int The block index within the input stream.

How does this relate to the value in the stream start packet? Is it an offset of that number (i.e. always starts at 0), or is it relative to the beginning of the stream? I'm guessing the External Data packet's value follows a similar principle.

Yes. I'll make it clear that the offset is the start of the stream, not the segment.

Data packets contain an entire block's worth of data. If the data's length is less than the block size, the rest of the block will be filled with zero bytes. [...] NOTE: This packet is only used to send a complete block of data. If a stream segment ends mid-block

If the Data packet can only send complete blocks, how can the packet size be less than the block size?

Consider sending 2 files and not packing the ends of files into a single block. The fraction of a block at the end of the first file will be in a block by itself, and not take up the entire block. So, the Data packet only needs to hold the fraction of a block that represents the end of the file. The Data packet does not need to contain the zero padding that fills out the rest of the block.

Even if you have just 1 file, it's usually good to have the stream end on a block boundary, so that all the data gets protected. In that case, the last block of the file is probably not full. You can just send a Data packet with the end of the file and the rest of the input block will be filled with zeros. Then that last input block will be protected by Redundancy packets.

[External data] 16*? {rolling hash, 12-byte fingerprint hash} A rolling checksum and finger print for each input block

I think the K12 hash size should be consistent everywhere. 12 bytes seems weird if it's 16 bytes everywhere else.

I agree it is odd.

We need (at least) a 16-byte hash here. But I didn't want to use more than 16-bytes. I achieved that by replacing 4 bytes of the K12 hash with the 4-byte rolling hash.

This decision is far from final. I'm not sure what we'll use for block hashes. But it will probably be 16-bytes in length and some, if not all, will be a rolling hash.

[Segment End] If this is the first segment in the stream, it is the K12 hash of 16 bytes of zeros followed by the data in the segment.

Is this still the case if the "previous segment's" hash was randomly generated?

Yes.

Also, does the hash include the trailing data, or only blocks included in Data packets? If operating in file mode, I presume there is never any trailing data as everything is handled by External Data packets, so will be a concatenation of all data?

Yes, the hash in the Segment End packet covers the trailing data.

In file mode, the client can arrange to not have trailing data. It can fill out the rest of any block with zeroes.

Yes, the hash in the Segment End packet is a hash of all data concatenated.

(NOTE: I am worried about that. If we use the K12 (or Blake3) hash, it means that all file data must be hashed in-order. That makes it hard to do things with multiple threads. I'm still not sure about using K12 (or Blake3) for that reason.)

This hash does put an upper limit of performance achievable though - if K12 maxes out at, say, 3GB/s, you'll never be able to create a PAR3 faster than 3GB/s, regardless of how fast the disk or however many CPU cores are available. With PAR2, a client could at least hash files in parallel to try to circumvent this limit.

Yes. It is a performance issue.

More than one [matrix packet] type can be used at the same time

Can the same type be used multiple times?

Yes.

There will be many Explicit Matrix packets, one for every Recovery packet. If you want R recovery packets, you need R Explicit Matrix packets.

A Cauchy Matrix could be used to protect the first stream segment. Then a different Cauchy Matrix packet could be used to protect the second stream segment.

Note: The 8-bit Galois field is supported by the x86 instruction GF2P8MULB.

I did some more research and found that the 0x11b polynomial isn't primitive so generally not preferred for RS coding. Most implementations of GF(256) I've seen use 0x11d (which can still be accelerated by GFNI, but not via the multiply instruction).

Thanks. I've changed it and added a note.

Note: All 64-bit Galois fields are supported by the x86 instruction CLMUL. Note: The ARM processor has an instruction extension called "NEON" with a VMULL instruction. VMULL.P8 will do eight 8-bit Galois field multiplications at once.

Thought I'd point out that neither of these instructions do reduction, which is why they're called "carryless multiplication" (x86) or "polynomial multiplication" (ARM) as opposed to GF multiplication. They can be used for reduction, but often require a few rounds. The lack of efficient reduction can often be a performance limiter on approaches that rely on these instructions.

[Cauchy] The hint to the number of recovery blocks is used in single-pass situations to allocate buffers. If the number of rows is unknown, the hint is set to zero.

It might be worth making the hint a little stricter, for example, the number is the highest number of recovery blocks that the encoding application knows is available, assuming no corruption occurs (e.g. if the creator is making 100 recovery blocks, the number here should be 100).

I like it being a hint. I'm thinking of Usenet, where if a receiver has too many bad data blocks, they may ask the sender to generate more recovery blocks. In that case, the hint will have the original number of recovery blocks, but the true value would be larger.

How does this work with split PAR files? Should the value be different for every volume?

Good question.

I suppose it could be either.

So far, I've been imagining that the reading client would be disk limited. It could calculate some recovery blocks "for free", because the CPU was idle and waiting for the disk drive to fetch the next block. Or, in the streaming context, calculate the recovery data while waiting for the next block to be sent. How many recovery blocks it got "for free" would depend on the relative speed of the CPU(s) and disk(s). If verification failed and the number of blocks computed "for free" was enough, repair could be done in a single pass.

If verification failed and there weren't enough "for free" recovery blocks, I assumed there would be a second pass that read all the PAR files first before processing the input files.

Do you have thoughts on where the hint will be used and, if so, what its value should be in the multiple PAR file situation?

Otherwise, the matrix's element for input block I and recovery block R depends on I and R. (NOTE: This specification uses zero-index vectors, so I and R start at 0.) Specifically, it is the multiplicative inverse of x_I-y_R, where x_I is the Galois field element with the same bit pattern as binary integer I+1 and y_R is the Galois field element associated with the binary integer MAX-R, where MAX is the maximum binary integer value with the same size as the Galois field. (NOTE: In binary, MAX contains all ones.) To be clear, the multiplicative inverse and subtraction x_I-y_R are done using Galois field arithmetic. The I+1 addition and MAX-R subtraction is done using native integer arithmetic.

So to put it another way, the coefficient is inv( (I+1) ^ (~0-R) ) = inv( (I+1) ^ ~R ) = inv( (-2-I) ^ R ) where inv is the GF inverse and ^ denotes XOR?

"inv( (I+1) ^ (~0-R) )" looks right.

I don't know about the rest. Let's see if I can work it out.

Two's compliment negation is defined as: -x = (~x +1)

=inv( (I+1) ^ (~0 + (~R + 1)) ) =inv( (I+1) ^ (~R + (~0 + 1)) )

And ~0 + 1 = 0 seems right, because every carry bit is 1 until it overflows out of the integer.

=inv( (I+1) ^ (~R) )

And x^~y = ~x^~~y = ~x^y seems right. (Looks like De Morgan's Law for XOR.)

=inv( ~(I+1) ^ R )

And I'm guessing that here, you rewrote two's compliment negation as: -x -1 = ~x

=inv( (-(I+1)-1) ^ R ) =inv( (-I-1)-1) ^ R ) =inv( (-I-2) ^ R ) =inv( (-2-I) ^ R )

Yep, it looks right. Nice math! Still, just in case we both screwed it up, I'd put the original formula in my code and let the compiler's optimizer figure out the rest. ;)

[Sparse Random] 8 unsigned int maximum number of recovery blocks

I'm not familiar with sparse random matricies, but does this mean that additional recovery cannot be generated after initial creation?

That's correct. You would not be able to make additional Recovery packets using that Matrix packet.

You could, of course, generate a new Matrix packet (either a Sparse Random one with a different seed or a different type altogether) and send Recovery packets for the new Matrix packet.

[Explicit matrix] For each input block that is used to calculate the recovery block, there is a pair of the index of the input block and its Galois Field factor.

If data is being appended (i.e. more input blocks), does this block need to be updated?

You can't update the Explicit Matrix packet. The usual way to handle sending more data is to send a new Explicit Matrix packet with indexes that cover the new stream segment. (And a new Recovery packet to go with it.)

For each input block that is used to calculate the recovery block, there is a pair of the index of the input block and its Galois Field factor. The pairs are in sorted order, with input block indices increasing from lowest to highest.

I presume the inputs not specified are presumed to be zero?

Yes.

Thanks. I added a note to this effect.

[Recovery Data] When calculating recovery data, only complete input blocks are used. (This is to prevent having two different values for a input block when a stream is appended to.)

In the streaming context, the trailing data doesn't have any associated recovery, but in the file context, it does, right?

No. The trailing data never has associated recovery data.

In the file context, clients will usually append zero bytes to completely fill the last block. Then there is no trailing data and all the input data is covered by the recovery data.

?*8 data The recovery block data

I'm guessing the recovery block doesn't need to be a multiple of 8 here.

Oops! Good catch.

[File] Sections of the file are mapped to the input stream. Each section is specified by 4 values. The first value is an offset in the file. The second value is the length of the section to be mapped.

Are we limiting files to 2^64? If so, the second number should be 8 bytes long.

Ok. Let's see what happens when I redo checksums.

Root Packet

If this points to a directory, what should that directory's name be (if relative paths are being used)?

Darn. I meant to explain this with some examples.

If the Root is absolute, the Directory it points to is the root directory, "/".

If the Root is relative, the Directory it points to is the current working directory.

Example: Root(absolute)->Directory->Directory("usr")->Directory("bin")->File("bash") is "/usr/bin/bash"

Another example: Root(absolute)->Directory->Directory("src")->File("par3cmdline.cpp") is "src/par3cmdline.cpp" from the current working directory.

Hmm.... this worked better when filenames were stored in the parent Directory packets and not in the File packets and child Directory packets. The name in the top Directory packet is being ignored.

This definitely needs to be spelled out clearly in the spec. And I need to decide if the top-level directory just has an empty string as a filename.

Thanks for pointing this out.

The checksum of the Checkpoint packet identifies the data in the stream

I don't see any Checkpoint packet type defined?

Right. It got renamed "Segment End packet". Sorry.

Fixed it.

It is REQUIRED that the user's permission be granted for any action that might jeopardize their security.

A client won't know what might jeopardize security. Maybe writing a text file causes some external application to buffer overflow and do who knows what - you never know. As such, I think the suggestion there is too strong. You can encourage good security practice, but a client can't predict everything, and this spec cannot dictate how an application chooses to deal with their users (assuming there are any).

True. But that's written in a paragraph about metadata. I think it's a strongly worded sentence, but I don't think it is going too far.

Every program has users. Somewhere, above all the systems and above all the layer, sits a user. Someone who got the system running in the first place and has to examine the log files when it stops working. :)

[Unix File] 8 signed int atime, nanoseconds since the Epoch

I'm guessing this is UTC from what's specified in the directory version (it's omitted from the file specification).

I've added ",UTC" to the File packet description.

Also, the UNIX directory packet doesn't have xattrs specified.

Good catch. Added.

If any UNIX File packet is present in a stream segment, then there is a UNIX File packet for every File packet

This could feel limiting if files happen to be spread across multiple filesystems.

What's your use case? People are protecting multiple hard drives with a single PAR file and some drives are formatted differently than others? It seems odd, but I suppose it could happen. I've certainly had Windows and Linux on the same machine.

Do you have a suggested replacement? I'm not sure how to do a UNIX Root for that use case. The receiving client needs a single checksum for the all the files, directories, and metadata. I'm not sure how that works if you're making a PAR file across multiple file systems.

There is a checksum for every UNIX File packets for each file in the directory

An alternative may be to only list files that haven't been listed in the regular directory packet (avoids relisting the entire folder).

True. But I'd like a checksum that reflects all the data. So it has to contain the checksum of the metadata, which is in all the UNIX File packets.

UNIX Root Packet

Is this necessary? From what I can gather, the root should be determinable via regular Root -> Directory <- Unix Directory linking (or Root -> File <- Unix file, if the root is a single file).

It is mostly to have a single checksum for all the UNIX meta data (and regular data via the regular Root).

There's also a question about how attributes of the root should be handled if the fact that it's a directory is largely irrelevant.

True.

users be warned when they create PAR files with names that are incompatible with Windows, Mac, or Linux systems. That is, file or directory names that are more than 255 characters long, start with a period (.) or a dash (-), or contain one of these characters: < > : " ' ` ? * & | [ ] \ ; or newline (\n).

Are there actually any systems that don't like the following?
* names that start with a `.` or `-` (other than those named '.' or '..')

* name containing ' ` & [ ] ;

Have you ever tried to delete file on UNIX that starts with a dash? Try creating a file named "-foo.txt" and running "rm -foo.txt".

(When it first happened to me, it took me an hour to delete it. Even when I tried to use a GUI, it failed because the GUI just called "rm"! Of course, that was in 1990, before the web, Google, and Stack Exchange. )

Files starting with a "." are hidden files on UNIX. They aren't visible when you run "ls". I think when I wrote the line, I was worried about PAR writing to "." or "..".

Those warnings were copied from the Par2 spec, which I wrote 20 years ago. I think I included the other symbols because they are used by shell commands.

The list of characters restricted by Windows is \ / : * ? " < > | (plus probably all ASCII characters < 32).

On Windows, an absolute path can start "C:" or "//" for example

UNC paths start with "\"

Fixed.

I also found Windows absolute paths can start with "\".

For UNIX, that means one starting with "/" or "//".

"//" starts with "/" so the latter doesn't need to be stated.

It doesn't need more code, but I do think it is worth stating. If only because "//" is something some client authors haven't heard of it.

Oh, I also added ones starting with "~". I'm not sure where that is handled, but I think it is worth saying.

using a feature like ".." in a path

Is there any reason to even allow "." or ".." as a name? The only case where I can see that making sense and being present would be paths for symlinks or hardlinks. Directories/files should otherwise never be named as such, so a decoder could just mark those as invalid upon seeing them.

I wasn't sure. I can see cases where I could want to use "..". I certainly don't think it should be allowed without a warning.

In order for a client to do single-pass recovery, the recommended order of packets for the stream use-case is:

The way the Explicit Matrix packet is defined, I'm not sure it can be placed before Segment End, as it requires knowing the number of input blocks?

It does require knowing the number of input blocks, but why can it not be used? Some sending clients will know every byte of what they're sending before they send it. The Explicit Matrix packet may not be useful in every situation, but there are situation where it can be used.

To make PAR inside work, the PAR packets need to contain one File packet, which refers to the file itself. Thus, the name in the File packet must match the name of the file

I liked the idea of using a blank filename instead, as it allows the file to be renamed.

I do too. I will add it.

I had mistakenly talked myself out of it, because of the multiple Par file case. That is, there is a file "foo.par3" that contains the primary stream's Root, File, and Directory packets as well as a second stream with self-repairing information. Then, if another file (e.g., "foo.vol00+10.par3") contains more self-repairing Redundancy packets, it needs to use the same Matrix packets and same File packets at the self-repairing information in "foo.par3". For that to work, the File packet in both files has to say "foo.par3".

But that's the multiple Par file case. For the single file case, like Par-inside-ZIP, we should also support an empty file name.

Some questions:

* How to handle duplicate files/directories (perhaps placed under different directories) if the packet's fingerprint hash is identical for each copy?

Good question.

I suppose it is just a copy (not a hard link or a soft link). Then, I guess, it can appear in multiple Directory packets without any conflicts. We should warn client authors that that might happen.

The hashing still prevents us from creating a cycle in the directory tree, so don't need to worry about that happening.

* Does the decoder need to know whether a protected file is complete or not? At the moment, the only indicating factors would be if there's no mapping for the first byte, or there are visible holes in the mapping. If the file is incomplete because the end isn't mapped, I don't think the client has any way of knowing.

I guess you're asking if Par3 should preserve the file's size.

Certainly for a ZIP file, which is read starting at the back, it is important that there not be any additional data beyond the last protected byte. Actually, for almost every file, we don't want data beyond the last protected byte!

Okay, I think we should be explicit about the parts we are not protecting.

* Would you happen to know anything about composite GF fields?

No.

"No" is probably a little too strong. I think I covered them a little in an Abstract Algebra class a long long time ago. I got the sense that they worked kinda-like multi-digit decimal numbers ... e.g., after you make a full cycle in the 1s place, there's a carry into the 10s place. But the carry can't be a normal carry, because of the field properties. There's an extra shift or something. Something like that. That's my impression from more than a decade ago, so it is probably incorrect.

Thanks again for taking the time to read it and all your comments/questions!

mdnahas commented 3 years ago

And, if I haven't said it, thank you, Yutaka-Sawada, for your comments too.

Yutaka-Sawada commented 3 years ago

The StreamSegmentID is used to identify packets that should be processed together, even if those packets were written to separate PAR 3.0 Recovery files or arrive via different transmission methods. The StreamSegmentID is a globally unique random number.

I think that 64-bit (or even 32-bit) is enough random to distinguish PAR files on a local storage. As users select PAR files by their filename mostly, less size ID won't cause problem. But, you define that the StreamSegmentID is a globally unique random number. When it should be an unique value over the world, it would be at least 16-bytes. This is because UUID (Universally Unique Identifier）or Microsoft GUID (global unique identifier) is 128-bit. If you use 64-bit ID, you had better erase the words of "globally unique" from the PAR3 spec.

This decision is far from final. I'm not sure what we'll use for block hashes. But it will probably be 16-bytes in length and some, if not all, will be a rolling hash.

I suggested to use a set of two 64-bit rolling hash as one 128-bit hash in Input File Slice Checksum packet. They are CRC-64 and Multiplicative hash, and both can act as rolling hash. This will eliminate a collision problem at sliding search, because it doesn't need to calculate fingerprint hash at each offset byte.

There was a problem in using these simple hashes for slice's checksum. Such simple fast hash isn't strong at all. It's possible to forge them; CRC or Multiplicative hash. Then, I came up with an idea; file's hash. At creating PAR3 file, it calculates three hash values (CRC-64, 64-bit Multiplicative hash, and 128-bit fingerprint hash) of each Input File Slice. The former two hashes are saved on Input File Slice Checksum packet as rolling hash. The last hash of all slices in the file are used to calculate a file hash, which is saved on File packet. The file hash is a hash value of all slices in the file by tree-hash structure. Thus, it's possible to calculate slice's checksum in any order or in parallel. Even when a malicious hacker could modify a slice data and forge checksum to hide the difference, 128-bit file hash (enough strong fingerprint hash) will reveal that the file is changed. Though a user cannot detect which slice is damaged, he can know the file has something damage. In that case, PAR3 client will treat all slices in the file are damaged.

If you think that the above idea isn't enough, I can add more checksum length. The design of my suggested Multiplicative hash on previous post was simple. It based on a static prime value. So, by using different prime number, the resulting hash value becomes different. It means that you can calculate some different 64-bit Multiplicative hashes at once with a same function. A PAR file may contain some repeated packets. I think that they are not need to be duplicate.

For example, a checksum packet is repeated 4 times; Input File Slice checksum packet contains set of (CRC-64 and Multiplicative hash A). Input File Slice checksum packet2 contains set of (CRC-64 and Multiplicative hash B). Input File Slice checksum packet3 contains set of (CRC-64 and Multiplicative hash C). Input File Slice checksum packet4 contains set of (CRC-64 and Multiplicative hash D). When the checksum packet isn't repeated, the rolling hash is two 64-bit hashes. When the checksum packet is repeated 4 times, the rolling hash is five 64-bit hashes. They must be enough random to distinguish slices, and will be hard to forge.

How to handle duplicate files/directories (perhaps placed under different directories) if the packet's fingerprint hash is identical for each copy?

Furthermore, there is a good usage of this tuple (CRC-64 and Multiplicative hash). While they are slice checksum, they can indicate duplicate slices. There is a good property; both CRC-64-ISO and 64-bit Multiplicative hash becomes 0 for null bytes. When CRC-64-ISO is 0x0000000000000000, the slice data bytes are all 0x00. So, the hash value 0x0000000000000000 can be a special item, and another 64-bit value can be an optional. I can use the 64-bit value to indicate the duplicate slice index.

For example, the value is "slice index + 1"; When CRC-64 = 0x0000000000000000 and Multiplicative hash = 0x0000000000000000, slice data = 0x00. When CRC-64 = 0x0000000000000000 and Multiplicative hash = 0x0000000000000001, the slice data is same as #0 slice. When CRC-64 = 0x0000000000000000 and Multiplicative hash = 0x0000000000000002, the slice data is same as #1 slice. Because the hash value space is 64-bit, it can store any slice index of another slice.

When a PAR3 client reads an Input File Slice Checksum packet, it can know which slice is same as another slice with ease. At creation time, a client needs to compare slices to detect duplicate slices with tuples (CRC-64, 64-bit Multiplicative hash, and 128-bit fingerprint hash). Though the 128-bit fingerprint hash of each slice is un-available at verification time, the compared result is stored in the Input File Slice Checksum packet.

After a client know duplicate slices, it can omit them from mapping slices on blocks. When there are 100 slices and 10 slices are same, they are mapped on 90 blocks. Duplicate slices are skipped at mapping automatically. So, File packet may require two values; Starting index of blocks : Where slices are put on blocks' stream Number of slices to put: This number doesn't include the number of duplicate slices. This is a hint to see how many block the file consumes.

For example, a file consists of 6 slices, which are mapped to 4 blocks; A file : [SliceA][SliceB][SliceA][SliceC][SliceC][SliceD] blocks : [SliceA][SliceB][SliceC][SliceD]

And, if I haven't said it, thank you, Yutaka-Sawada, for your comments too.

I'm glad if I could help you. Because I cannot understand theoretical aspects of Recovery Codes so much, I wrote from a programmer's practical view (such like, how to implement). Though I come up with many idea, I'm not sure that it was good or useless. Please, just pick good one only.

mdnahas commented 3 years ago

The StreamSegmentID is used to identify packets that should be processed together, even if those packets were written to separate PAR 3.0 Recovery files or arrive via different transmission methods. The StreamSegmentID is a globally unique random number.

I think that 64-bit (or even 32-bit) is enough random to distinguish PAR files on a local storage. As users select PAR files by their filename mostly, less size ID won't cause problem. But, you define that the StreamSegmentID is a globally unique random number. When it should be an unique value over the world, it would be at least 16-bytes. This is because UUID (Universally Unique Identifier）or Microsoft GUID (global unique identifier) is 128-bit. If you use 64-bit ID, you had better erase the words of "globally unique" from the PAR3 spec.

"globally unique" is an adjective which means unique when considering all numbers on every machine, vs. "locally unique" which only means unique for this machine. (If I remove "globally", then client authors can use the same random number generator on different machines and that might cause collisions.)

How many bits we need depends on the usage. If I'm only generating 1000 numbers, 32-bits gives me enough uniqueness that there is a less than 1 in 8,000 chance of a duplicate. So, 32-bits provides a pretty good (but not exceptional) probability of global uniqueness.

So it matters what our usage is. How many StreamSegmentIDs do you think we will generates? How often do you think we will compare two different StreamSegmentIDs to determine if they are the same?

I believe there will be fewer than 2^32 StreamSegmentIDs created in the next 20 years.

I believe we will very rarely compare StreamSegmentIDs to see if they are equal. It will definitely happen if people use "PAR inside PAR". The other cases are when someone sticks a lot of Par3 files in the same directory.

Because of that usage, I think 64-bits is enough for the StreamSegmentID. You mentioned the UUID and GUID, but their usage is different. They will be used to store all objects stored everywhere in the world and, when those objects are stored in a hash table, there will be lots of compares for equality. With more objects and more compares, they need a longer ID. They need more than 64-bits.

Notice that for blocks, we're creating a lot more of them (2^64) and they can be all compared with each other. We need at least 128-bits for local uniqueness there. (We should probably add 32 bits more, but I'm willing to cut a corner there.) But I don't think we need 128-bits for StreamSegmentIDs. 64-bits should be fine.

This decision is far from final. I'm not sure what we'll use for block hashes. But it will probably be 16-bytes in length and some, if not all, will be a rolling hash.

I suggested to use a set of two 64-bit rolling hash as one 128-bit hash in Input File Slice Checksum packet. They are CRC-64 and Multiplicative hash, and both can act as rolling hash. This will eliminate a collision problem at sliding search, because it doesn't need to calculate fingerprint hash at each offset byte.

Yes. I remember you suggesting those. I haven't looked at the Multiplicative hash yet. I also want to look at other rolling hashes, including reusing the CRC-64 with the bits permuted.

There was a problem in using these simple hashes for slice's checksum. Such simple fast hash isn't strong at all. It's possible to forge them; CRC or Multiplicative hash. Then, I came up with an idea; file's hash. At creating PAR3 file, it calculates three hash values (CRC-64, 64-bit Multiplicative hash, and 128-bit fingerprint hash) of each Input File Slice. The former two hashes are saved on Input File Slice Checksum packet as rolling hash. The last hash of all slices in the file are used to calculate a file hash, which is saved on File packet. The file hash is a hash value of all slices in the file by tree-hash structure. Thus, it's possible to calculate slice's checksum in any order or in parallel. Even when a malicious hacker could modify a slice data and forge checksum to hide the difference, 128-bit file hash (enough strong fingerprint hash) will reveal that the file is changed. Though a user cannot detect which slice is damaged, he can know the file has something damage. In that case, PAR3 client will treat all slices in the file are damaged.

I'd like to separate this into 2 problems. One problem is making sure the file arrives correctly. The other is forging of blocks.

The way we make sure the file arrives correctly is the hash in the File packet. It doesn't matter how we calculate the file's fingerprint. Right now the specification says K12, but we know that might prevent parallel computation. So, I'm considering other ways to calculate it, including block-by-block (like you mentioned) or Merkle tree approaches.

The other problem is forging of blocks. That is, an evil person can interfere with recovery by creating a second block that matches a original block's hash code. When the forged block is present, the client thinks that all blocks are correct, but, when it calculates the file's fingerprint hash, the checksums don't match. The client cannot do repair nor verify the result.

The current draft of the specification prevents blocks from being forged by using a 12-byte K12 hash for each block. Since the K12 is hard to reverse, it is difficult to forge a block.

If we switch to a CRC or another rolling hash for blocks, it will be easier to forge them. Do we worry about forging? With a 128-bit rolling hash, I'm not too worried about a random bit flip causing the hashes to match. The only way this can happen is someone doing it on purpose. It would definitely look bad if someone can create a file that Par3 says is not right but that Par3 also says it cannot repair.

Are there rolling hashes that are hard to forge? E.g., if we used the K12 hash of a CRC hash, would that work? It would be hard to reverse, since the attacker would have to reverse the K12 hash. It would be fast to update the CRC for any windows size. The K12 hash might be fast because it is only hashing the 128-bits. But I don't know how fast the K12 is with small size inputs ... I believe it is slower than 3GBps.

Do we need unforgeable blocks? Any other ideas about how to make unforgeable blocks fast?

Furthermore, there is a good usage of this tuple (CRC-64 and Multiplicative hash). While they are slice checksum, they can indicate duplicate slices. There is a good property; both CRC-64-ISO and 64-bit Multiplicative hash becomes 0 for null bytes. When CRC-64-ISO is 0x0000000000000000, the slice data bytes are all 0x00. So, the hash value 0x0000000000000000 can be a special item, and another 64-bit value can be an optional. I can use the 64-bit value to indicate the duplicate slice index.

That's an interesting property.

For example, the value is "slice index + 1"; When CRC-64 = 0x0000000000000000 and Multiplicative hash = 0x0000000000000000, slice data = 0x00. When CRC-64 = 0x0000000000000000 and Multiplicative hash = 0x0000000000000001, the slice data is same as #0 slice. When CRC-64 = 0x0000000000000000 and Multiplicative hash = 0x0000000000000002, the slice data is same as #1 slice. Because the hash value space is 64-bit, it can store any slice index of another slice.

If the hash of each block is unique, I'd rather store the block's hash there. The decoding client can detect that they are identical.

Still, this idea has value. We need to assign each input block to a place in the stream (a.k.a. single virtual file). Since I have to change the File packet's mapping and change the External Data packets, the way I'm doing it has to change. This technique might be useful in identifying duplicate input blocks. Or, better, ranges of duplicate input blocks.

I'm glad if I could help you. Because I cannot understand theoretical aspects of Recovery Codes so much, I wrote from a programmer's practical view (such like, how to implement). Though I come up with many idea, I'm not sure that it was good or useless. Please, just pick good one only.

That's what I always try to do: get lots of ideas and pick the good ones. :)

Yutaka-Sawada commented 3 years ago

When CRC-64-ISO is 0x0000000000000000, the slice data bytes are all 0x00.

About the case "CRC is 0", I was wrong. When data is a multiple of generator polynomial, resulting CRC becomes 0, too. Because the prerequisites was wrong, my thought in the subject was bad also. When the checksum is 0, the slice bytes may be null-bytes mostly, but it's not always. Then, the usage is impossible. I'm sorry for the confusion.

Yutaka-Sawada commented 3 years ago

If we switch to a CRC or another rolling hash for blocks, it will be easier to forge them. Do we worry about forging? Do we need unforgeable blocks?

I want it to be unforgeable somehow at least MD5 level. Some paranoid users afraid of forging, and they want cryptographic hash in PAR3. But, Slice Checksum's 128-bit is too short to perform cryptographic strength anyway. On the other hand, file's hash would be longer size, like 256-bit or 512-bit hash. As number of files is less than number of slices, file's hash size won't become a problem.

For example, PAR client calculates 512-bit fingerprint hash for each Input File Slice. It may store partial 128-bit (16-bytes) of the 512-bit (64-bytes) hash as Input File Slice Checksum. After calculated slice checksums, it calculates 512-bit (64-bytes) hash from the whole slice checksums, and stores it as Input File's hash. In this way, paranoia will be satisfied with the hash size.

I would like fewer checksums on blocks. There are a lot of blocks so any per-block overhead causes a lot of storage. It prevents us from doing really small blocks.

I consider how much size will be a problem. I assume a tuple of 64-bit rolling hash (CRC-64) and 128-bit fingerprint hash for Input File Slice Checksum. This checksum size (24-bytes per slice) may become a problem, when there are many slices (= very small block size).

For example, there are 1 TB (= 1024 GB = 2^40 bytes) source data, and I create recovery data with 6% redundancy. 1024 GB source data will make 1024 / 16 = 64 GB recovery data. I put them on 4 PAR3 files (and 1 index file). For simple calculation, I ignore Packet header size, other type packets, and repeating same packets.

When block size is 4 KB (2^12 bytes), there are 268,435,456 (2^28) source blocks and 16,777,216 (2^24) recovery blocks. The total size of Input File Slice Checksums becomes 2^28 24 = 6144 MB = 6 GB. 4KB_block_size.par3 = 6 GB checksum 4KB_block_size.vol00000000+4194304.par3 = 6 GB checksum + 16 GB recovery data 4KB_block_size.vol04194304+4194304.par3 = 6 GB checksum + 16 GB recovery data 4KB_block_size.vol08388608+4194304.par3 = 6 GB checksum + 16 GB recovery data 4KB_block_size.vol12582912+4194304.par3 = 6 GB checksum + 16 GB recovery data Total file size = 30 GB checksum + 64 GB recovery blocks = 94 GB PAR3 files Efficiency is 64/94100 = 68%.

When block size is 64 KB (2^16 bytes), there are 16,777,216 (2^24) source blocks and 1,048,576 (2^20) recovery blocks. The total size of Input File Slice Checksums becomes 2^24 24 = 384 MB. 64KB_block_size.par3 = 384 MB checksum 64KB_block_size.vol000000+262144.par3 = 384 MB checksum + 16 GB recovery data 64KB_block_size.vol262144+262144.par3 = 384 MB checksum + 16 GB recovery data 64KB_block_size.vol524288+262144.par3 = 384 MB checksum + 16 GB recovery data 64KB_block_size.vol786432+262144.par3 = 384 MB checksum + 16 GB recovery data Total file size = 1.9 GB checksum + 64 GB recovery blocks = 65.9 GB PAR3 files Efficiency is 64/65.9100 = 97%.

When block size is 1 MB (2^20 bytes), there are 1,048,576 (2^20) source blocks and 65,536 (2^16) recovery blocks. The total size of Input File Slice Checksums becomes 2^20 24 = 24 MB. 1MB_block_size.par3 = 24 MB checksum 1MB_block_size.vol00000+16384.par3 = 24 MB checksum + 16 GB recovery data 1MB_block_size.vol16384+16384.par3 = 24 MB checksum + 16 GB recovery data 1MB_block_size.vol32768+16384.par3 = 24 MB checksum + 16 GB recovery data 1MB_block_size.vol49152+16384.par3 = 24 MB checksum + 16 GB recovery data Total file size = 120 MB checksum + 64 GB recovery blocks = 64.1 GB PAR3 files Efficiency is 64/64.1100 = 99%.

When block size is 32 MB (2^25 bytes), there are 32,768 (2^15) source blocks and 2,048 (2^11) recovery blocks. The total size of Input File Slice Checksums becomes 2^15 * 24 = 768 KB. 32MB_block_size.par3 = 768 KB checksum 32MB_block_size.vol0000+512.par3 = 768 KB checksum + 16 GB recovery data 32MB_block_size.vol0512+512.par3 = 768 KB checksum + 16 GB recovery data 32MB_block_size.vol1024+512.par3 = 768 KB checksum + 16 GB recovery data 32MB_block_size.vol1536+512.par3 = 768 KB checksum + 16 GB recovery data Total file size = 3.75 MB checksum + 64 GB recovery blocks = 64 GB PAR3 files Efficiency is 99%.

Setting 4 KB block size will cause bad efficiency. 64 KB or larger block size will be ok. I think that 64 KB is enough small, and no need to set smaller block size than this level. When you don't require very small block size, current checksum size per slice is acceptable.

But, packet repetition causes worse efficiency. Because each PAR file include Input File Slice Checksum packets, number of PAR files cause packet duplication. When there are 5 PAR files, the packet size becomes 5 times larger. If a PAR client repeats packets in a PAR file, the number of packets becomes multiple times. When it repeats packets 10 times in 5 PAR files, total magnitude becomes 50 times. To mitigate this effect, block size needs to be enough large.

We need an efficient method to protect Input File Slice Checksum packets instead of repeating them simply. How about an idea of "Parity packet" ? Because you don't like a big packet, you split Input File Slice Checksum data to many small packets. For example, split 6 GB Slice Checksum data into 1024 Slice Checksum packets (each packet contain 6 MB checksums). As each PAR3 file above has 1.6% redundancy, the same level redundancy gives 16 Parity packets. 16-bit Reed-Solomon Codes create 16 Parity packets from 1024 Slice Checksum packets. The additional size is only 16 * 6 = 96 MB. In same time, "Parity packet" may protect other vital packets also. For users, this built-in packet protection may be easier than "PAR inside PAR" method.

mdnahas commented 3 years ago

I spent the last few days trying to find a way to enable lots of tiny blocks without the overhead of a rolling hash and checksum for each block. I came really close, but then realized it was impossible. I spent a lot of time thinking about the ideas, so I want to tell you about them.

Alignment does not need to be handled by a rolling hash. There is a concept called "content-defined chunking". Instead of fixed-size blocks, it uses variable-sized "chunks". A chunk ends when any of a set of patterns occurs in the file. (One set of patterns is when the sum of the last 128 bytes modulo a prime number is equal to a fixed value.) Content-defined chunking handles alignment without rolling hashes, because you can find the boundaries for chunks just by scanning the file.

Variable-sized chunks would cause all sorts of other issues with PAR, but I found solutions for most of them by splitting chunks into equally-sized blocks. The fraction-of-a-chunk at the end isn't much trouble, because we have to currently handle the fraction-of-a-file at the end of each file.

The next savings was realizing that the Par3 file does not need to contain a checksum for each "chunk". We can only recover R of the N chunks, so we need at most log(N choose R) values in the file, where "N choose R" is the binomial coefficient and equals N!/(R!(N-R)!). Log(N choose R) is approximately NH(R/N) where H is the entropy function H(x)=xlog(1/x)+(1-x)log(1/(1-x)). It is still linear, but it saves a lot of overhead. H(1%)=0.06, H(5%)=0.20, H(10%)=.32. So, even if we make the number of recovery blocks equal to 10% of all blocks, this technique saves 68% of the overhead.

That would be pretty amazing right --- get rid of the rolling hash and get rid of 68% or more of the fingerprint checksums. But, alas, it doesn't work. We still need to know which chunk goes where in the file!

Even if we just store the ordering of the chunks, with N chunks there can be N! orderings of blocks. Just transmitting which ordering of blocks takes log(N!) bits and, using Stirling's Approximation, that is roughly N*log(N) bits. So, we're stuck with a super-linear factor in the number of chunks. We can get by with many fewer bits than the 128-bits per block that we're using now (e.g., get by with H(R/N) + log(N) bits) but it doesn't quite seem worth all the effort.

I did like the idea of content-defined chunking. It is very useful for deduplication, which is important if we want to support incremental backups. Basically, if someone inserts anything in the middle of a file, the chunks stay the same at the beginning and end. That doesn't work with blocks, where inserting a single byte can throw off the alignment of the block boundaries. Even if we keep rolling hashes for each block, we might want to include CDC to make incremental backups better. If we kept most blocks the same and ended 1 in 100 blocks on a CDC boundary, it might make incremental PAR files a lot smaller.

animetosho commented 3 years ago

Thanks for all the clarifications to all the questions @mdnahas!

I have a lot of duplicate files on my system. They may not take up a lot of space, but they're there. So we have to address the issue of duplicate fingerprint hashes. Also, if we allow incremental backups, there will probably be a lot of duplicate blocks between two different versions of the same file. It seems foolish to not de-duplicate those.

Ah I see. In such a case, you probably wouldn't want to restrict deduplication to block boundaries (unless you want to align files to block boundaries as well).

So, under the current design, we're sending 2 checksums for every block. And the math says we get by with a lot fewer

I see where you're coming from now.

MD5 was 8-bytes. 16-bytes seems long!

MD5 is 16 bytes. It's generally considered too short for a cryptographic hash these days - most modern crypto hashes are at least 256 bits long (or sometimes slightly shorter like SHA2-224).

Every byte is overhead. Especially when added to every block and/or packet. So I'm hesitant to add more to blocks and packets.

Here's a different thought: for packet checksums, I don't actually think a crypto hash is really necessary because they only need to verify that the packet hasn't been corrupted. We're already implicitly trusting everything in the PAR file, so no need to worry about deliberate modification.
Currently, the hash has a secondary use in being an identifier, which is probably where the crypto hash requirement comes from. If you separate that out (i.e. two fields - a verification hash + identifier), you might be able to just use a shorter hash for verification and only put identifiers on packets that need them.

We need (at least) a 16-byte hash here. But I didn't want to use more than 16-bytes. I achieved that by replacing 4 bytes of the K12 hash with the 4-byte rolling hash.

Why the 16 byte limit? PAR2's 20 bytes doesn't seem to be a problem. I get that saving bytes is important, but I wouldn't skimp on those 4 bytes.

I would say that an automated system does have a user. Somewhere. No system is fully automated --- when the system breaks, someone shows up.

I think you're stretching the definition of 'user'. Automated systems will try to avoid breakage, and a well designed system will often achieve that aim fairly well. Generally maintenance technicians aren't considered to be 'users' of the system. I suppose you could write something like 'user/maintainer' to help avoid such confusion.

But consider a client implemented as a library - that doesn't have a user either. The application implementing the library may have a user, but the library often can't directly interact with them.

I don't think it's a good idea for a spec to try to push these things as far as a 'requirement', because the application developer will have much better knowledge of their target audience/usage and be able to cater to such more appropriately than a spec could ever predict. A spec's purpose is to provide a standard for inter-operability, not a mandate on how developers write applications.

And as previously mentioned, a spec has no way to mandate how an application chooses to interact with its users, so putting 'required' there is kinda pointless anyway. If I designed an application which fully works but doesn't adhere to that 'requirement', who's to say that it's not a valid PAR3 implementation?
Client authors aren't trying to be malicious to their users, so I don't see the need for a spec to try to force some baseline here.

If we did that, we'd need to specify a format for "the contact information". E.g., should it be an email address or a URL or something else. I decided that if everything is just going to be shown to the user, it might as well put everything in a single string.

I'm not sure most users would be terribly interested in knowing the options the PAR was created with. For the purpose of diagnosis, options would be most useful to the developer of the creating application.
Otherwise, I have no objection to a freeform text field.

It just represents the starting location for new data. It is not that important in the file context.

So always put 0 in file context?

I had assumed that every client would implement generic GF code that worked on bytes. And also implement a handful of special-case GFs that would be optimized.

Well, there's also varying degrees of generic code. Generic code which only needs to handle up to GF(2^64) can look quite different to one which handles GF(2^32768), to one which tries to handle GF(2^(2^40)) on systems without enough RAM to hold a single number of that size.

I'm not too fond of the idea of needing to fall back to generic code, as optimisations can be pretty significant (like 20x speed difference). There's no guarantees that clients will all choose to optimise for the same cases, so you'll never really know how much any optimisation will bring.
And users could end up being confused over why some PAR files are just so much slower to deal with than others.

It's also not unreasonable to imagine an enterprising implementation could try to optimise for all cases, if there aren't too many of them. I see no reason to purposefully stifle such an approach.

The fraction of a block at the end of the first file will be in a block by itself, and not take up the entire block. So, the Data packet only needs to hold the fraction of a block that represents the end of the file.

I thought the Data packet was only allowed during streaming?
If you're allowing it for files as well, do you think it'd make sense to be consistent with how the final block is handled between file and streaming modes? If files require padding, I think it'd make sense to do the same with streaming.

A Cauchy Matrix could be used to protect the first stream segment. Then a different Cauchy Matrix packet could be used to protect the second stream segment.

The way it's written sounds like you could have two cauchy matricies protecting the first segment as well?

I like it being a hint. I'm thinking of Usenet, where if a receiver has too many bad data blocks, they may ask the sender to generate more recovery blocks. In that case, the hint will have the original number of recovery blocks, but the true value would be larger.

The last part is what I meant. "Hint" is too arbitrary in meaning for any application to know how to make use of it, but if you say that the true value should be >= this number, it's clearer.

Do you have thoughts on where the hint will be used and, if so, what its value should be in the multiple PAR file situation?

For a multi-volume setup, one can figure out the available recovery blocks from the filenames.

Otherwise, for a single volume setup, the client either has the full file available (can see all recovery, so no hint necessary), so it's doing something clever with part of the file. I can only see the hint benefitting the latter case; in which case, knowing which blocks are available is more important than the number, though most of the time, it'll be a sequential range. I guess the client can assume the recovery block range is from 0 to the hint number.

Yep, it looks right. Nice math!

Thanks for confirming!

That's correct. You would not be able to make additional Recovery packets using that Matrix packet.
You can't update the Explicit Matrix packet

So only the Cauchy type supports updates or appending using the same matrix?
For non-cauchy, it sounds like each appended segment is effectively a completely separate PAR file (with some linkage to indicate that the two PAR sets are related)?

Example: Root(absolute)->Directory->Directory("usr")->Directory("bin")->File("bash") is "/usr/bin/bash"

I would've thought you'd go something like

Root(absolute)->Directory("/usr/bin")->File("bash")

...assuming all the files you wanted to protect exist under /usr/bin.
Otherwise you'd have a bunch of unnecessary directory info that's irrelevant to the data you're protecting.

Another example: Root(absolute)->Directory->Directory("src")->File("par3cmdline.cpp") is "src/par3cmdline.cpp" from the current working directory.

I'm guessing you meant 'relative' instead of 'absolute' there?

I need to decide if the top-level directory just has an empty string as a filename.

The thing is, I'm not sure you want full directory info on the root. If you use your example for absolute, if it really is "/", then what use does storing ownership info, permissions etc of "/" serve?

The way I'd see it is that the Root packet itself is a directory, but with most of the properties stripped out.

True. But that's written in a paragraph about metadata.

That one in particular is, but the same thing is stated later in a different section:

Clients are REQUIRED to get approval for any action that might compromise security.

Assuming we do want to comply with the requirement, I feel the only way to achieve it is to literally ask the user about any action that occurs (which will encourage the user to seek ways to disable such warnings, which likely defeats the purpose of it in the first place).

You're already relying on the good judgement of the client author to determine what could compromise security. Why not also defer judgement on how the client chooses to deal with the varying degrees of severity or risk?

What's your use case? People are protecting multiple hard drives with a single PAR file and some drives are formatted differently than others? It seems odd, but I suppose it could happen.

Can't see it being a common use case, but it's one I'd be surprised at not working.
Perhaps I have a USB drive, formatted NTFS (for portability), and mounted onto a folder, on an Ext root partition. I build a PAR3 on the parent of this folder, which will contain both Unix and NTFS metadata.

Do you have a suggested replacement? I'm not sure how to do a UNIX Root for that use case. The receiving client needs a single checksum for the all the files, directories, and metadata.

I don't feel particularly strongly about needing a single checksum for these cases.
I generally see the metadata packets as offshoots of the main file/directory packets (I mean, they are optional), which means a root wouldn't be necessary.

Have you ever tried to delete file on UNIX that starts with a dash? Try creating a file named "-foo.txt" and running "rm -foo.txt".

Yeah, you stick './' in front of the name. I think it even detects this and give you such a suggestion.

$ rm -foo.txt
rm: invalid option -- 'o'
Try 'rm ./-foo.txt' to remove the file '-foo.txt'.
Try 'rm --help' for more information.

But that's just a command-line issue. I doubt any half decent GUI would choke on it.

Those warnings were copied from the Par2 spec, which I wrote 20 years ago. I think I included the other symbols because they are used by shell commands.

It's kinda odd nowadays because it doesn't include a bunch of stuff like ! or # (or % on Windows) and designing for a command-line shell is also an unusual case when the vast majority of users don't use one.

I vote restricting the list to things the OSes actually don't support. But if you want to keep everything there, it might be worth pointing out why they're listed, because it doesn't make a whole lot of sense as is.

I also found Windows absolute paths can start with "".
Oh, I also added ones starting with "~".

So... have you found them all now? Sure there aren't any more? XD

I can see cases where I could want to use "..".

Example?

It does require knowing the number of input blocks, but why can it not be used? Some sending clients will know every byte of what they're sending before they send it.

If you know the exact size, then yes, but most streaming setups don't. For example, if I have the command gzip some_file | par ... | nc remote_server, there's no way for par to know the size up front.

What about transposing the Explicit matrix so that it's attached to input blocks as opposed to recovery? I suppose this restricts the ability to create further recovery though.
Maybe it's just not worth trying to support it for unknown lengths.

"No" is probably a little too strong. I think I covered them a little in an Abstract Algebra class a long long time ago. I got the sense that they worked kinda-like multi-digit decimal numbers ... e.g., after you make a full cycle in the 1s place, there's a carry into the 10s place. But the carry can't be a normal carry, because of the field properties. There's an extra shift or something. Something like that. That's my impression from more than a decade ago, so it is probably incorrect.

I mostly mention it because I've seen people look at using composite fields for effectively larger GF fields whilst using primitives for a smaller field. At least, that's what I'm guessing is the case, as I have no knowledge on the math, so am kinda talking beyond what I know.

If I remove "globally", then client authors can use the same random number generator on different machines and that might cause collisions

I think it's best to describe that a bit more clearly. "Globally unique" does imply GUID to many programmers, who will find it odd that there's only 64 bits.

E.g., if we used the K12 hash of a CRC hash, would that work?

No - it's just as manipulatable as a straight CRC, since being able to have different data generate the same CRC (trivial) will result in the same K12.

It would be hard to reverse, since the attacker would have to reverse the K12 hash

In fact, even that is theoretically possible for CRC64 or smaller. via something like a rainbow table.

Not that they'd need to - they've likely got the source data from which they can compute the unhashed CRC from.

Do we need unforgeable blocks?

I like the idea of it, particularly with how prevalent crypto hashes are being used for verification these days - secure hashes are pretty much considered the norm. It can also make application design a little easier in terms of what can be trusted (particularly for cases like dedupe).

Sadly, I have no idea on how to make it fast in a rolling sense.

Thinking about the common usages of PAR2 though (personal backups and Usenet), I don't think forged blocks are particularly concerning, so it might not actually be that important if it can't be fully catered for.

Oh, on the topic of file naming, is there going to be some standard to help indicate an appended PAR (or allow an application to find which file is the base PAR from an appended one)?

animetosho commented 3 years ago

Pinging @akalin in case he has any interest in this as a PAR client author.
(this thread is awfully long, but the spec listed here might be a good summary for now)

Yutaka-Sawada commented 3 years ago

I feel that current "PAR Inside Another File" feature may be too complex.

When PAR is used inside another file format to protect it, we call it "PAR inside". So, if PAR protects a ZIP file, we call it "PAR inside ZIP". To make PAR inside work, the PAR packets need to contain one File packet, which refers to the file itself. The File packet would map only the protected portions of the file to the input stream. In the case of the ZIP file, only the start and end of the file would be mapped to the input stream. The middle of the file, where the PAR packets are stored, would not be mapped.

Long ago, I wrote an example PAR2 usage about how to add recovery record to ZIP or 7-Zip archive. I put the text on MultiPar's Help documents. Because most people won't read the document, I uploaded the usage on my homepage, too. Currently PAR2 can perform "PAR inside ZIP".

The only reason this exists is for the "PAR inside" feature. If we do "PAR inside ZIP", the protected regions are the compressed files at the front of the file and the directory structure at the end. It does not cover the space in the middle where we'll put the PAR packets.

I think that PAR3 won't require something special items in PAR3 packets. I said about items (offset and length for mapping) in File packet ago. When PAR2 can handol such usage, PAR3 will do samely without setting mapped range. MultiPar recognizes such archive file as "Appended" state, when the ZIP file is complete. Maybe "PAR inside" or "PAR attach" state will be good for damaged ZIP file. But, it's difficult to determine this usage now. When the ZIP file is repaired, attached PAR2 packets will be removed. This problem may be solved in PAR3, because File packet's filename will indicate the usage.

It is possible to protect the non-redundant parts of a PAR file using "PAR inside". This is called "PAR inside PAR".

The original idea (making PAR files to protect another PAR files) may be good. I agree the usage, when PAR packets are stored on independent PAR files. But, it will cause problem to put the "inside" PAR packets on the same file as "outside" PAR packets.

For example, I assume the system like below; source_data : This is an input file. outside.vol#+#.par3 : This PAR file protects source_data. inside.vol#+#.par3 : The second PAR file protects the first PAR file. PAR_inside_PAR.vol#+#.par3 : This is a combination of outside.vol#+#.par3 and inside.vol#+#.par3.

The PAR inside packets would be distinguishable from the "outside" PAR packets because they would have a different StreamSegmentID.

Though a PAR client can distinguish packets and recognize their usage, it will be difficult to determine what a user wants to do. When PAR files are independent, it's easy to process. When a user selects outside.vol#+#.par3 for verification, a PAR client verifies source_data. When a user selects inside.vol#+#.par3 for verification, a PAR client verifies outside.vol#+#.par3. But, when a user selects PAR_inside_PAR.vol#+#.par3 for verification, does PAR client verify which file ? Maybe a PAR client shows query message to ask the user's action, such like; "Do you want to verify which file ? self file (PAR_inside_PAR.vol#+#.par3) or source file (source_data)." If the PAR client is GUI, there may be buttons of "Verify", "Repair", and "Self-test". While a PAR client author may develop a good method, it will be bad for automated task.

"PAR inside PAR" data can be generate and scattered through out the file.

I'm not sure how to do this. inside.vol#+#.par3 was made for outside.vol#+#.par3. You can split these PAR3 files between the packets' boundary. You can put "inside" PAR packets between "outside" PAR packets. But, block size is different from packet's size. You need to split outside.vol#+#.par3 and insert many space temporary; resulting PAR_inside_PAR.vol#+#.par3. Then, you create inside.vol#+#.par3 by setting proper mapping for the temporal PAR_inside_PAR.vol#+#.par3. After that, you split inside.vol#+#.par3 and put packets on many temporal space of PAR_inside_PAR.vol#+#.par3. Though this task is possible, pre-calculating space (size of partial "inside" PAR packets) between "outside" PAR packets will be complex and bothersome.

These two problems are made by "PAR inside PAR" (putting 2 PAR sets in a same PAR file) style. "PAR outside PAR" (using independent PAR files) style is easier, and PAR2 can do already. I don't know why you want such complex system for PAR3.

mdnahas commented 3 years ago

Thanks for all the clarifications to all the questions @mdnahas!

I have a lot of duplicate files on my system. They may not take up a lot of space, but they're there. So we have to address the issue of duplicate fingerprint hashes. Also, if we allow incremental backups, there will probably be a lot of duplicate blocks between two different versions of the same file. It seems foolish to not de-duplicate those.

Ah I see. In such a case, you probably wouldn't want to restrict deduplication to block boundaries (unless you want to align files to block boundaries as well).

Right. If we rely on equal-sized blocks, inserting or deleting a single byte can throw off the deduplication.

This is why I was interested in the "content-defined chunking". Chunks are variable sized and the boundaries align with the contents of the file. That way, if a single byte is inserted or deleted, only a handful of chunks get messed up.

Unfortunately, variable-sized chunks won't work as well with Reed-Solomon or any algorithm that works with fixed-sized blocks. I need to think more on it, but I'm guessing we can find a good compromise, where 90% of blocks are full-sized and 10% end on a content-defined boundary. That would mean a single-byte change would cause us to store at most 10 blocks in the incremental backup.

MD5 was 8-bytes. 16-bytes seems long!

MD5 is 16 bytes. It's generally considered too short for a cryptographic hash these days - most modern crypto hashes are at least 256 bits long (or sometimes slightly shorter like SHA2-224).

Ooops! Yes, you're right, 16-byes.

Yes, K12 and Blake3's default is at least 32-bytes. A longer hash (almost) always means it is harder to crack.

Every byte is overhead. Especially when added to every block and/or packet. So I'm hesitant to add more to blocks and packets.

Here's a different thought: for packet checksums, I don't actually think a crypto hash is really necessary because they only need to verify that the packet hasn't been corrupted. We're already implicitly trusting everything in the PAR file, so no need to worry about deliberate modification.

Okay. I can get behind the "if the villain can get us to trust any PAR packet, we're already gone" approach.

I had been kinda worried that a villain could sneak in a single packet (e.g., in an additional PAR file) and prevent the whole system from working.

Currently, the hash has a secondary use in being an identifier, which is probably where the crypto hash requirement comes from. If you separate that out (i.e. two fields - a verification hash + identifier), you might be able to just use a shorter hash for verification and only put identifiers on packets that need them.

Yes. I've been thinking of this for File/Directory/Root packets. If we want a 32-byte fingerprint for every file and the entire the file system, we probably don't want to reuse the packet hash, but have them use a separate fingerprint hash of their contents.

We need (at least) a 16-byte hash here. But I didn't want to use more than 16-bytes. I achieved that by replacing 4 bytes of the K12 hash with the 4-byte rolling hash.

Why the 16 byte limit? PAR2's 20 bytes doesn't seem to be a problem. I get that saving bytes is important, but I wouldn't skimp on those 4 bytes.

I thought 16 bytes was enough and I didn't want to add more overhead per block. I figured 12-byte fingerprint and a 4-byte rolling hash was probably as secure as a 16-byte fingerprint. (Yes, it is easier to hack, but it's also so weird that fewer people would want to hack it.)

I would say that an automated system does have a user. Somewhere. No system is fully automated --- when the system breaks, someone shows up.

I think you're stretching the definition of 'user'. ....

My concern is fixing broken clients. I think a client breaking will be a rare event. But when it happens, I really want someone to see "Something went wrong. Please email xxx@yyy.com and copy-and-paste all the information on your screen". I want it to be very very simple to report a problem.

An incompatibility is a rare event. But with many clients written by many authors, it is a real possibility. And any incompatibility is a real pain to everyone and really hard to track down. So, I want every client to make reporting it easy. And I want the information reported to be useful.

How do you want the specification to read? I'll write something different so that you're happy with it. I just care that the PAR file contains the information to help fix any problem and that every client makes it easy to report this problem.

It just represents the starting location for new data. It is not that important in the file context.

So always put 0 in file context?

No! You always put the stream's length from the previous Segment End packet.

In the file context, if it's a new Par file, you put 0. If you're doing an incremental backup, you put the length of the stream in the previous backup.

I had assumed that every client would implement generic GF code that worked on bytes. And also implement a handful of special-case GFs that would be optimized.

Well, there's also varying degrees of generic code. Generic code which only needs to handle up to GF(2^64) can look quite different to one which handles GF(2^32768), to one which tries to handle GF(2^(2^40)) on systems without enough RAM to hold a single number of that size.

True. But if the spec says "any", you write one for any.

I'm not too fond of the idea of needing to fall back to generic code, as optimisations can be pretty significant (like 20x speed difference). There's no guarantees that clients will all choose to optimise for the same cases, so you'll never really know how much any optimisation will bring. And users could end up being confused over why some PAR files are just so much slower to deal with than others.

Yes, but the speed difference will probably happen anyway. Different processors have different hardware and accelerate different GFs. It will not be equal speeds on every machine.

And we don't know what the hardware of the future will be. Perhaps 512-bit GFs will be the fastest in the future. I'd rather not have to redo the specification for every piece of hardware. And listen to how the new GF doesn't work on older Par clients.

The fraction of a block at the end of the first file will be in a block by itself, and not take up the entire block. So, the Data packet only needs to hold the fraction of a block that represents the end of the file.

I thought the Data packet was only allowed during streaming?

No.

The Data packet can store input blocks during the streaming or file context. Our usual file usage is to send the data inside the original files, but it can also be sent inside the Par file. Par2 has an optional "Input File Slice Packet", where the input data is stored inside the PAR file.

If you're allowing it for files as well, do you think it'd make sense to be consistent with how the final block is handled between file and streaming modes? If files require padding, I think it'd make sense to do the same with streaming.

I would say it is already consistent. In the streaming context, the input stream ends when it ends. It could be on a block boundary or it could be inbetween. In the file context, it is the same. The stream can end on a block boundary or it could be inbetween.

For the file context, we can control where the stream ends and it makes more sense to pad the data so that it ends on a block boundary. That way, the final block is included in the recovery data. The file context does not have to do that.

Similarly, in the streaming context, if you can control where the stream ends, it makes sense to have it end on a block boundary. Not all streams can pad their data, but many can.

A Cauchy Matrix could be used to protect the first stream segment. Then a different Cauchy Matrix packet could be used to protect the second stream segment.

The way it's written sounds like you could have two cauchy matricies protecting the first segment as well?

Yes.

It is kinda strange with the Cauchy Matrix, because if the values in the packet are the same, then the redundancy blocks are the same. Perhaps it is better to think about Random Sparse Matrices that use different keys for the random number generator.

I like it being a hint. I'm thinking of Usenet, where if a receiver has too many bad data blocks, they may ask the sender to generate more recovery blocks. In that case, the hint will have the original number of recovery blocks, but the true value would be larger.

The last part is what I meant. "Hint" is too arbitrary in meaning for any application to know how to make use of it, but if you say that the true value should be >= this number, it's clearer.

A "hint" means you cannot rely on the value being true. I like that client writers will be careful not to relying on the exactness of the value. I prefer the word "hint".

Do you have thoughts on where the hint will be used and, if so, what its value should be in the multiple PAR file situation?

For a multi-volume setup, one can figure out the available recovery blocks from the filenames.

I don't think we should be relying on filenames.

Otherwise, for a single volume setup, the client either has the full file available (can see all recovery, so no hint necessary), so it's doing something clever with part of the file. I can only see the hint benefitting the latter case; in which case, knowing which blocks are available is more important than the number, though most of the time, it'll be a sequential range. I guess the client can assume the recovery block range is from 0 to the hint number.

Will a client read all the Par3 files before processing the input files? If so, the hint is unnecessary in that case.

It does make a difference for the streaming case, where it basically tells the receiver how many buffers to create.

That's correct. You would not be able to make additional Recovery packets using that Matrix packet. You can't update the Explicit Matrix packet

So only the Cauchy type supports updates or appending using the same matrix?

Kinda. The Cauchy matrix is different from the other matrices because it is basically infinite. It can have 2^64 rows and columns, which is far larger than the number of input and recovery blocks. I suppose you could make the others matrices really large too, but there's need to.

If the sender doesn't know how much data is being sent at the very beginning, it can use the Cauchy Matrix packet and adjust the size of the matrix to fit the data. The Recovery packets contain a pointer to the matrix packet and to the Segment End packet, which basically is there to tell where to cut off the Cauchy matrix. So, if you do a recovery packet with the same Cauchy Matrix packet and a different Segment End packet, the size of the Cachey matrix is different.

For non-cauchy, it sounds like each appended segment is effectively a completely separate PAR file (with some linkage to indicate that the two PAR sets are related)?

I wouldn't put it that way. The appended segment can reuse File and Directory packets from the original PAR file. The appended segment can use recovery data from the original PAR file. Just because it has new Matrix packets doesn't mean very much.

Example: Root(absolute)->Directory->Directory("usr")->Directory("bin")->File("bash") is "/usr/bin/bash"

I would've thought you'd go something like

Root(absolute)->Directory("/usr/bin")->File("bash")

...assuming all the files you wanted to protect exist under /usr/bin. Otherwise you'd have a bunch of unnecessary directory info that's irrelevant to the data you're protecting.

But they are not irrelevant if do an incremental backup with a file inside "/usr/".

Another example: Root(absolute)->Directory->Directory("src")->File("par3cmdline.cpp") is "src/par3cmdline.cpp" from the current working directory.

I'm guessing you meant 'relative' instead of 'absolute' there?

Yes. relative.

I need to decide if the top-level directory just has an empty string as a filename.

The thing is, I'm not sure you want full directory info on the root. If you use your example for absolute, if it really is "/", then what use does storing ownership info, permissions etc of "/" serve?

The way I'd see it is that the Root packet itself is a directory, but with most of the properties stripped out.

After I wrote my email, I started thinking along those lines. I think you're right.

True. But that's written in a paragraph about metadata.

That one in particular is, but the same thing is stated later in a different section:

Clients are REQUIRED to get approval for any action that might compromise security.

Assuming we do want to comply with the requirement, I feel the only way to achieve it is to literally ask the user about any action that occurs (which will encourage the user to seek ways to disable such warnings, which likely defeats the purpose of it in the first place).

You're already relying on the good judgement of the client author to determine what could compromise security. Why not also defer judgement on how the client chooses to deal with the varying degrees of severity or risk?

I want to make sure the client author things about these things. They are probably more knowledgeable about how a hack to the system would happen than the user. How would you suggest wording it?

What's your use case? People are protecting multiple hard drives with a single PAR file and some drives are formatted differently than others? It seems odd, but I suppose it could happen.

Can't see it being a common use case, but it's one I'd be surprised at not working. Perhaps I have a USB drive, formatted NTFS (for portability), and mounted onto a folder, on an Ext root partition. I build a PAR3 on the parent of this folder, which will contain both Unix and NTFS metadata.

Okay. It's a little odd, but within reason. I'm certainly mounting an EXT4 in an NFS partition, so that my files appear inside a Windows VM.

Do you have a suggested replacement? I'm not sure how to do a UNIX Root for that use case. The receiving client needs a single checksum for the all the files, directories, and metadata.

I don't feel particularly strongly about needing a single checksum for these cases. I generally see the metadata packets as offshoots of the main file/directory packets (I mean, they are optional), which means a root wouldn't be necessary.

I'll have to think about how this would work. It's mind-bending.

Have you ever tried to delete file on UNIX that starts with a dash? Try creating a file named "-foo.txt" and running "rm -foo.txt".

Yeah, you stick './' in front of the name. I think it even detects this and give you such a suggestion.
$ rm -foo.txt
rm: invalid option -- 'o'
Try 'rm ./-foo.txt' to remove the file '-foo.txt'.
Try 'rm --help' for more information.

Nice solution!

And on Sun SPARC 4s, there was definitely no suggestion!!

But that's just a command-line issue. I doubt any half decent GUI would choke on it.

Those warnings were copied from the Par2 spec, which I wrote 20 years ago. I think I included the other symbols because they are used by shell commands.

It's kinda odd nowadays because it doesn't include a bunch of stuff like ! or # (or % on Windows) and designing for a command-line shell is also an unusual case when the vast majority of users don't use one.

I vote restricting the list to things the OSes actually don't support. But if you want to keep everything there, it might be worth pointing out why they're listed, because it doesn't make a whole lot of sense as is.

Ok.

I also found Windows absolute paths can start with "". Oh, I also added ones starting with "~".

So... have you found them all now? Sure there aren't any more? XD

Nope. Not sure at all.

I can see cases where I could want to use "..".

Example?

If I have a package that installs a binary in "/usr/bin/" or "~/bin/" and want to install a config file in "/etc/" or "~/".

[I have to go some place now. I will reply to the rest later.]

animetosho commented 3 years ago

I had been kinda worried that a villain could sneak in a single packet (e.g., in an additional PAR file) and prevent the whole system from working.

I suppose a client could assume the base PAR file to be the most trustworthy, and treat further volumes with less trust (i.e. if they conflict, prefer info from the base PAR file).

How do you want the specification to read?

Just say the practice is strongly recommended and put the reasoning you wrote there as justification. The client author can figure out whether it makes sense for their use case.

In the file context, if it's a new Par file, you put 0. If you're doing an incremental backup, you put the length of the stream in the previous backup.

Hmm, I think I get it now. So updating files requires the new data to be appended to the end of the virtual file, even if the updates are to stuff in the middle.
If an update doesn't append anything, it's possible that multiple streams may have the same "previous stream's total length" value.

And we don't know what the hardware of the future will be. Perhaps 512-bit GFs will be the fastest in the future

I think there's basically no likelihood of that ever happening. Larger multiplication circuits are more complex and slower than smaller ones. Even if CPU designers add dedicated hardware for larger GF, it's pretty much guaranteed to be slower than the smaller ones.

Also keep in mind that the largest units CPUs (and GPUs, for that matter) deal with is 64 bits. This hasn't really changed in the past ~20 years, and probably won't change much in the next 20 or so. Perhaps 128 bit operations will gain popularity, but anything beyond that is highly unlikely.

I think GF64 is a very safe bet for anything in the next 20 years, but if you want to be extra safe, up to GF128 will cater for basically any development that occurs.
Personally, the likelihood that GF128 will ever be measurably better than GF64 is so low that I can't see it being worth the added complexity (and GF128 is already so far along the line of pointlessness that anything larger certainly has zero value), but you can be the judge of that.

Similarly, in the streaming context, if you can control where the stream ends, it makes sense to have it end on a block boundary. Not all streams can pad their data, but many can.

I'm thinking of how it'd likely be implemented as a library. Yeah, ending full block is ideal, but that might not be possible, or the application using the stream might not care.
I can't see any case where padding is impossible, since you know the block size and the number of bytes streamed at that point?

A "hint" means you cannot rely on the value being true. I like that client writers will be careful not to relying on the exactness of the value. I prefer the word "hint".

I think the word "hint" is fine, but its intention should be described more. "Hint" by itself, means very little, and without further clarification, I don't see how a client author would know how to make use of it.

I don't think we should be relying on filenames.

I get where you're coming from, though in many circumstances, they're used (for example, a Usenet downloader to know how much recovery it should download). Perhaps not relied upon fully, then again, what it's substituting is a hint anyway.

Will a client read all the Par3 files before processing the input files?

If you can't rely on the filenames, then you'd probably have to parse all PAR files, in which case, the hint would be useful.

It does make a difference for the streaming case, where it basically tells the receiver how many buffers to create.

Streaming is a bit of an interesting case in which additional recovery cannot be added. Of course, the decoder can't rely on all recovery blocks being in tact.

I wouldn't put it that way. The appended segment can reuse File and Directory packets from the original PAR file. The appended segment can use recovery data from the original PAR file. Just because it has new Matrix packets doesn't mean very much.

Sorry, I meant from a input mapping / recovery packet perspective. Yeah, there's reuse in metadata, but from a non-metadata perspective, it seems like the appended streams are almost completely separate.

But they are not irrelevant if do an incremental backup with a file inside "/usr/".

Doesn't the root get overshadowed in such a case? Original root is at /usr/bin, the new root is at /usr. Existing directories can still be referenced.

I want to make sure the client author things about these things. They are probably more knowledgeable about how a hack to the system would happen than the user. How would you suggest wording it?

Replace 'required' with 'should' or 'recommended'.

If I have a package that installs a binary in "/usr/bin/" or "/bin/" and want to install a config file in "/etc/" or "/".

The way I'd see it is that the root should point to the deepest directory that is common amongst all files/folders, meaning that backtracking with '..' should never be necessary.

If you have /usr/bin/binary and /etc/config, then the deepest common directory is '/'. On the other hand, if you only had /usr/bin/binary and /usr/lib/library, the deepest common directory is '/usr'.

mkruer commented 3 years ago

@mdnahas, I am still trying to digest or is that ingest all the PAR logic. One thing that peaked my interest was on your "enable lots of tiny blocks without the overhead" This is not a well thought out idea on my part, but would it be possible to create the smaller block from a larger block after the fact? or was this part of your "then realized it was impossible"

Yutaka-Sawada commented 3 years ago

As animetosho mentioned ago, using one thread may be hard to trace subjects;

this thread is awfully long, but the spec listed here might be a good summary for now

It will be good to make a new repository (such like par3cmdline) under Parchive. Then, he may use the top page readme.md to show progress of PAR3 project, and put par3_spec.txt as source code. It's possible to separate some subjects on Issues. That will be easy to read. Furthermore, a user will be able to request a feature or ask a question by creating his new issue.

boggeroff commented 2 years ago

Unfortunately most livebusinesschat.com links are dead. This I found though: Some problems for next PAR design

mdnahas commented 2 years ago

I'm back working on par3. I wasn't feeling well.

I've been reading about "content-defined chucking", which breaks up files into variable-sized chunks (rather than fixed-sized blocks) based on the contents of the files. E.g., chunk boundaries occur whenever the hash of an 8-byte window ends with 14 zero bits. The good part of content-defined chunking is "deduplication", where we identifying the same data in multiple files.

However, I don't think it will work. First of all, content-defined chunking works best on random data, like that output by a compression algorithm. Since we don't require compression, we could run into some problems with the distribution of the sizes of chunks. But most importantly is the problem of making variable-sized chunks work with an error-correcting code that expects fixed-sized blocks. I've thought about it for a while and I haven't come up with a solution.

I'll keep thinking about it. We already allow some blocks to be smaller than the blocksize, because the end-of-file doesn't always align with the end of a block. So, I may allow a few blocks to be smaller than the blocksize in the middle of a file, so that we can do some deduplication and reuse blocks even if they aren't aligned on blocksize boundaries in every file. Maybe. The rare probability of a small win is probably not worth the cost of complexity.

I'm also rethinking the streaming features. I really like the idea of a streaming protocol and a pipeable usage like: metadata2data inputdir | mergefiles | deduplication | compression | encryption | par3 > output.par3 But so much of this design has been forced because of the requirements to work with files. I might abandon the streaming features and put that into another design.

I agree with @Yutaka-Sawada that this thread is super long and hard to follow. I don't want to create a repository "par3cmdline" because we don't have any code and the specifications don't live in the code repos. The current version of specifications lives in the "doc" directory of the website repo: https://github.com/Parchive/parchive.github.io/tree/master/doc

What do you think to moving the PAR3 discussion to an issue there? I could even check in drafts of PAR3 into the repo.

animetosho commented 2 years ago

There's plenty of 'specification repositories' around on Github. It actually makes a lot of sense, as you can make use of the versioning/tagging capabilities of Git, have others update it via pull requests, inline comments etc.
I'd keep it separate from PAR2, and not make some placeholder 'par3cmdline' repository, in my opinion.

On a different note, since I don't really understand it myself, but do you know anything about the techniques used in https://github.com/catid/leopard ?

mdnahas commented 2 years ago

I know only a little. Leopard and FastEEC use the FFT algorithm to decode Reed-Solomon quickly. FastEEC used a a Galois Field with a prime number of elements, but Leopard works with our GF(2^p), as long as they have a particular value for the generating polynomial. I skimmed the paper. I could not tell if it needed a special matrix to make it work, but I don't think so.

Leopard Paper

mdnahas commented 2 years ago

So, my tentative redesign is:

We drop streams. So, we don't need segments, byte indexes, or the Segment End packet. The end-of-input-data packet will be the Root packet. I may move some fields that were in the Segment End packet to the Root packet.

We also need big changes the File packet, which mapped bytes of the file to bytes in the input stream. It will now map blocks of the file to indexes for input blocks. I think it works this way:

The entire file is broken into chunks. Usually, a file will just contain a single chunk, but the concept of chunks is necessary for files that share data and when we stick Par data inside another file. So a file is made of non-overlapping chunks. Each chunk has an entry in the mapping, in order. If a part of the file is not protected, we indicate that in the mapping. (Previously, that part of the file was not present in the mapping.) For each chunk, the data in the mapping is:

the length of the chunk in bytes
a fingerprint hash for the entire chunk. (e.g., 32-byte K12 hash)
if the chunk's length > blocksize, then we have the index used to store the first block. The subsequent blocks in the chunk take up subsequent indexed blocks. So, if the first block of the chunk has index 5, the second block will have index 6, the third block has index 7, etc. If we don't want to protect the data in this chunk, we store the value of 2^64-1 in this field. NOTE: It is possible to have different files (or even the same file) overlap input block indexes. This is how we show data is shared by files.
If the chunk's length % blocksize != 0, then we have data about the end of the chunk. This is only used if the end of the chunk doesn't fill a complete block. For it, we store that last fraction-of-a-block's rolling checksum and fingerprint checksum. We also store the index of the input block that contains it and its offset within that block. (This means we can pack lots of ends-of-files into the same input block.) If we don't want this chunk protected, we store the value is 2^64-1 in the index of input block field.

The whole file is covered by the chunks. So, we can determine where each chunk starts in the file by summing up the lengths of the chunks before it. We also know the file's length by summing up the length of all the chunks.

This design does allow us to share data between files. If the sending client wants to do it, it can find the data that overlaps between files and encode it in the mappings. The sending client doesn't have to do that, but it can. The decoding client has the much easier job of recovering the data.

This design also allows the sending client to store lots of little files or lots of the ends of files into a single input block. The sending client doesn't have to, but it can. If it does do it, the file contains checksums for both the separate little pieces and for the entire block. (The whole block's checksum will be either in the Data packet or External Data packet.) The design allows for smaller-than-blocksize pieces of files, but is optimized for blocksize-d pieces. So, the decoding client can scan files with a fixed-sized windows to find most of the blocks. The left-over pieces can be checked against the end-of-chunk pieces. Decoding clients can probably do more to search for the tiny pieces, but I think just checking the left-over pieces will be enough for most clients.

I'm not sure if I want to include a fingerprint hash for the entire file. Most files will just contain a single chunk, so adding another hash seems redundant. But we do want per-chunk fingerprint hashes, because of the Par-inside-another-file case.

How does that sound?

I still need to reread everyone's notes around about File/Directory/Root packets and see if we want to change those. I am also thinking of changing the UnixFile/UnixDirectory/UnixRoot packets, since there would be a lot of packets and most of them are just storing file permissions and there is a lot of reuse of file permissions in data.

animetosho commented 2 years ago

Thanks for the response - I was interested in Leopard as it seems to be a MDS code but faster than classic dense matrices.

I like simpler specs, so it sounds good!

The chunking concept sounds like it could be simpler to implement than the previous way of arbitrarily mapping to the single virtual file.

Presumably this means that files can share common chunks, but a single file cannot reuse the same chunk?
Will be interesting to see how it deals with handling updates/modifications.

I'm not sure if I want to include a fingerprint hash for the entire file. Most files will just contain a single chunk, so adding another hash seems redundant

I like the idea of excluding file hashes. It allows a client to break up a file into multiple chunks, if it's worried about hashing speed on a single thread.
Theoretically, it should also allow chunk size = block size, so hashes could be reused between the two, to some extent.

mdnahas commented 2 years ago

A single file could reuse the same chunk. A file is a list of chunks, so there's no reason the chunks cannot be repeated. E.g., a chunk might have a single block that is full of zero bytes and a file could very easily include multiple copies of that chunk.

There will be file hashes. That's for certain. We're focused on recovery and the only way to make sure you've transmitted a file correctly is to have a hash for it. I can also guarantee a tree of hashes-of-file-hashes, so the whole snapshot of files has a single hash.

I know you care that our speed is limited by file hashes, but speed is not our top priority. Making sure that when we say "the set of files arrived correctly" is. That means I want as little code on our end between the hashing of data and the single hash for each file. Also, I want as little of our code as possible between hash of the data and the single hash for the whole set of files. That's how we have the best assurance that the output data is the same as the input data. That's how we achieve out top priority.

The only question is if hashes-of-chunks are good enough to replace hashes-of-files, because most files will be a single chunk. And because we don't need to hash the parts of a file that are not protected. That you said you're willing to introduce unnecessary chunks to get speed makes me think that hashes-of-chunks is not good enough. I do not want client authors making that trade-off.

This is not to say I don't care about speed. I do. Just below correctness. When I look at hashcodes, I'm looking at the fastest cryptographic hashes, like Blake3 and KangarooTwelve. I even looked at Meow Hash, which is fast but not cryptographic. (It hasn't been studied enough, unfortunately.)

mdnahas commented 2 years ago

I've thought some more on the Root/Directory/File packets.

It's clear that the Root needs to represent either the root "/" or current directory "~/". So, it needs, like the Directory packet, to have a list of hashes for Directory and File packets.

There are two approaches to naming. We can either put the name of the file in the File packet or put the name with the File packet's hash in the Directory (or Root) packets. If file names are in the File Packet, then we are storing a tree. If the file names are in the Directory (or Root) packets, we have a directed acyclic graph (a.k.a., DAG).

The DAG gives us hard links for free. That is, the same file exists in multiple directories. But hard links are not supported by some file systems, like FAT and exFAT. Microsoft docs So, I think we'll go with the tree. Hardlinks will be a file-system specific feature.

The other benefit of putting the filename in the File packets is that Directory and Root packets will be smaller. If a directory contains a lot of files with long filenames, a Directory packet could get really large. And I'd prefer more equal sized packets to data loss due to damage.

mdnahas commented 2 years ago

For Windows systems, there isn't a single root directory. They have drives. And two different "absolute paths".

For absolute paths that start with a drive letter, like "C:\foo\bar.txt", I think we should have: Root(absolute) -> Directory("C:") -> Directory("foo") -> File("bar.txt")

For absolute paths that start without a drive letter, like "\baz\buzz.jpg", we can have: Root(absolute) -> Directory("baz") -> File("buzz.jpg")

So, Windows systems will have to recognize that "C:" as the directory name below the root with an absolute path means a drive-based absolute path. And, if the directory names do not match "[A-Z]:", then it is an absolute path without a drive letter and should apply to the current drive.

Does that make sense? The other option is to always force there to be a drive. And, optionally, created a drive name that means "current drive".

mdnahas commented 2 years ago

I'm now working on UNIX file permissions. Since many files and directories have the same permissions, it seems strange to have a separate packet for each File and Directory packet.

Hard and symbolic links are different from permissions. So, we can add a Link packet, with holds: the hash of the Directory/Root packet containing the link, the hash of the File/Directory packet being linked to, the name of the link, and a bit indicating if the link is hard or symbolic. And since NTFS and other file systems treat hard and symbolic links similarly, we might be able to share this packet with other packets for those filesystems.

A UNIXPermissions packet will hold the permissions (times, owner, group, i_mode, xattrs) as well as a list of hashes of File, Directory, and Link packets. The same permissions will be applied to all the files, directories and symbolic links in the list.

Most filesystems do not store permissions for hard links. Most UNIX filesystems store permissions of symbolic links, but ignore them. But BSDs and MacOS do use them. I'm proposing that we allow symbolic links to have permissions. The encoding client doesn't need to include them. The decoding client and OS can decide to ignore them.

Lastly, the UNIXGrouping packets will form a tree that contains the hashes for all the Link and UNIXPermissions packets. The root of the UNIXGroupings packets will also contain the hash of the Root packet. That way, if you get the root of the UNIXGrouping packets, you have all the files and all the permissions.

Technically, the UNIXGrouping packet will contain a list of Link, UNIXPermissions, and UNIXGrouping packet hashes and an optional hash to the Root packet.

mdnahas commented 2 years ago

On of the current open issue is supporting should we support multiple file systems at the same time. Someone's example was that they had an NTFS partition mounted on their Linux system. Similarly, I currently have an EXT4 partition mounted in a Windows VM.

I'm not sure yet, but we might be able to support that by, instead of having a UNIX-specific packet like "UNIXGroupings", we could have a "PermissionsGrouping" packet that supported UNIXPermissions, NTFSPermissions, FATPermissions, and whatever else we might create.

mdnahas commented 2 years ago

We have a new draft. It is still incomplete, but it feels very close.

I did made the following changes: added chunks added "tail packing" added that Root packet acts like a directory added Link packet for sym and hard links changed UNIX Permissions, to combine file and directory added PermissionsGrouping packet to store permissions, rather than UNIX Root dropped explicit support for 16-byte lengths for files.

I'm still not sure what's the best language for talking about parent/child input sets for incremental backup, nor talking about outer/inner input sets for Par-inside-Par. At the moment, those sections sound distant and imprecise.

Do I need to change order of packet descriptions? It felt strange to talk about the "input block index" in the Data packet description before I talked about chunks in the File packet description.

UNIX files often have a lot of permissions in common. But they rarely share atime, ctime, and mtime. Is it worth allowing UNIX Permissions packets to set permissions for multiple files/directories? Should I have unique atime/ctime/mtime for each?

Should the Root packet contain the hash of the top-level Permission Grouping packet, rather than vice versa?

Of course, any other comments are very welcome.

Par3_spec.txt

mdnahas commented 2 years ago

I spent an hour or two looking into FAT/FAT32/exFAT permissions. They're pretty easy.

I spent 4 to 8 hours looking into NTFS permissions. It took me a long time to even find a good document. The people who wrote the Linux driver wrote their own documentation!

The disk format is insane. There are timestamps in the $STANDARD_INFORMATION attribute and in the $FILE_NAME attribute. And there can be 4 file names!! After all this time, I haven't even figured out how the file's owner is recorded, let alone how permissions are determined for the owner.

Par3's design goal is not to be the perfect archiving format. It's only supposed to have minimal support for storing metadata. I'm quite willing at this point to say our support on Windows is FAT's file permissions plus hard and symbolic links. That is, not support owners, groups, access control lists, GUIDs, multiple file names, nor "named data streams".

Is FAT + links okay for NTFS? If not, I'm going to ask you to design the NTFS Permissions packet.

Yutaka-Sawada commented 2 years ago

Is FAT + links okay for NTFS?

Though I don't know the details of NTFS nor FAT, minimum supporting common metadata will be OK. I'm not sure which case requires such infromation. For example, my external HDD for backup was FAT32, while main drive is NTFS. USB memory, CD-R, or DVD-R isn't NTFS, too. NTFS specific metadata may not stored anyway.

mdnahas commented 2 years ago

I filled in most of the last TODOs in the specification.

I added detail for FAT file permissions and how to generate a sparse random matrix. I dropped support for NTFS permissions.

We need to make a few decisions about hashes and other parts, but I think we're close to being able to start a reference implementation in C, C++, or maybe Rust.

Par3_spec.txt

Previous Next