Parchive / par3cmdline

Official repo for par3cmdline and par3lib
GNU Lesser General Public License v2.1
90 stars 3 forks source link

Discussion of Par3 Specification #1

Open mdnahas opened 2 years ago

mdnahas commented 2 years ago

Par3's specification is in near-final form. I created this issue to discuss any changes that need to be made, as we work on a reference implementation of Par3.

The earlier discussion about the Par3 specification is in this issue on the par2cmdline repo on GitHub:

https://github.com/Parchive/par2cmdline/issues/130

The current draft of the specification is available at:

https://parchive.github.io/doc/Parity_Volume_Set_Specification_v3.0.html

Yutaka-Sawada commented 2 years ago

I found a typo in the spec, line 578;

Table: File Packet Body Contents

This may be "Table: Contents of a chunk description".

I have a question about how to map input files into input blocks. Is there any rule in the order of mapping input files ? PAR2 spec has definition of files' order by FileIDs on Main packet. In PAR3 spec, Root packet has a list of File packets and their order is defined. But, it seems that there may be no relation to mapping of input files into input blocks.

mdnahas commented 2 years ago

No, there is no relation on mapping input files to input blocks. I wanted that freedom, in case multiple threads were processing files separately.

It is best to use low numbers for the input blocks. E.g., if you use 50 input blocks, they should be numbered 0 to 49. (Or something like that.) Using low numbers leaves room for incremental backups.

Yutaka-Sawada commented 2 years ago

I saw an interesting point in the spec, line 347 under Start Packet Body Contents;

| 2 | unsigned int | The size of the Galois field in bytes.

The Galois field size is 2, when 16-bit Reed-Solomon Codes like PAR2. The Galois field size is 1, when 8-bit Reed-Solomon Codes like PAR1. Currently, most programing languages support 64-bit integer (8-bytes). Recent CPU's SIMD supports 128-bit, 256-bit or 512-bit (16, 32, or 64-bytes). But, won't 2040-bit Galois field size (255-bytes) be enough for PAR3 ? I felt that using 1-byte for the item might be enough.

Also, when only XOR is used like a simple parity or LDPC, may its Galois field size be stored as 0 ? At that time, is there no generator item in the Start packet ?

mdnahas commented 2 years ago

Hmmm.... for some reason, I thought SIMD 512 was 512 bytes, not bits. Yes, I think 1 byte is enough.

Change made. The attached MarkDown file has the fix.

For the moment, I left out support for XOR. First, I feel like there should be a generator polynomial that acts exactly like XOR, but I haven't taken time to look into it. Second, I wondered how much it would be used. Par3 operates on bytes and even using random 1-byte Galois Fields factors will significantly improve an LDPC's recovery rate. Lastly, it isn't hard to add support for it later, if we want.

Parity_Volume_Set_Specification_v3.0.md

mdnahas commented 2 years ago

I remembered why we don't need XOR: all Galois Field addition is XOR. We can create a recovery block using only XOR by making all the factors in the Code Matrix equal to 1. And that works for any Galois Field of any size.

If a client wants fast LDPC, they can just do the optimization that multiplying any input block by a factor of 1 is just the input block itself.

Yutaka-Sawada commented 2 years ago

While I was thinking how to put input files on input blocks, I found some interesting points. Basically PAR3 is more free than PAR2, and the construction may depend on PAR3 client's developers. It's too complex to be a headache.

As compared to PAR2, PAR3 is difficult (or almost impossible) to predict the number of input blocks. In PAR2, there is a strong relation between block size and number of input blocks. When a user set a block size in a PAR2 client, PAR2 can calculate the number of resulting input blocks for given input files. Also, when user set a preferable block count, PAR2 can suggest a reasonable block size for the given input files. In PAR3, new features (Packing of tail chunks and Deduplication) may decrease the number of actual input blocks than its pre-calculated number. So, difference between a user's specified block count and resulting block count will become larger.

There may be a range of Block size. Because it's stored as 64-bit integer (8-bytes), the format's maximum Block size is 2^64 - 1. The max size is limited by total size of input files, too. When all input files are put in one input block, the Block size is same as total file size. Setting larger block size than total file size is worthless. If packing feature isn't used, the max Block size may be same as the largest file size. About minimum block size, setting less size than 8 is worthless. When block size is 1~8 bytes, their checksums (CRC-64) can restore each input block without any recovery blocks. Thus, it may be good to mention that 8 is the minimum value of Block size.

I came up with another case of when recovery blocks are not required. When all input files are equal or less size than 39-bytes, they can be put in tail chunks by setting block size to be 40-bytes. In this case, there is no input blocks at all, because each input file is stored as raw data in File Packets. When some files are equal or larger than 40-bytes and other files are equal or less than 39-bytes, by setting Block size to be 40-bytes or larger, only small files can be restored without recovery blocks. Developers should be careful in treating those small files (equal or less than 39-bytes), because PAR2 didn't have such feature.

I have a question about External Data Packet. This packet contains checksums of input blocks. Does it require checksums of all input blocks ? When an input block consists of single or multiple tail chunks, their checksums are stored in each File Packet already. If it's possible to ignore checksums of such input blocks for tail chunks, the number of External Data Packets may be same as the total number of chunks in File Packets. Though it's possible to store checksums of all input blocks, duplicated checksums may not be useful.

Furthermore, when Block size is less than 24-bytes, checksums in External Data Packets become larger than original data of input files. In this case, it may be good to store raw data of input blocks as same as small tail chunk. Or, it may be good to suggest larger Block size than 24-bytes for efficiency. Anyway, users won't set so small Block size mostly.

mdnahas commented 2 years ago

Yes, it is harder to determine the number of input blocks. But I'm not sure that users care about that. I think they care more about how small a piece of damage can be repaired and how much recovery data there is.

Par3 is not good for very small blocks. There is too much overhead. I would be surprised if anyone wants to use blocksizes below 1024 bytes, because of the overhead. Personally, I would want blocks at least 2400 bytes, so because the overhead is at least 24-bytes per block and I'd want the overhead to be less than 1%.

I don't know that every 8-byte value has a unique CRC-64-ISO. It's possible, but I don't know that it is guaranteed.

Yes, large collections of tiny files would result in everything being stored in the tail data of the File packets. But I don't see a way around that ... other than zipping or tarring the files before using Par3.

The External Data packet is for a sequence of input blocks. The specification does NOT require that every input block has to be covered by a checksum in an External Data packet. It is up to the client to decide if they have enough information to do the recovery. I suppose a client author could try to deduplicate checksums, but I don't think there is much data to be saved by doing that.

As I said before, the minimum overhead is 24-bytes per block and around 131-bytes per file. So, it is not suited to small blocks or many very tiny files. If users want the overhead to be less than 1%, we're talking block sizes at least 4kb.

Yutaka-Sawada commented 2 years ago

While I tried to get file information by C-runtime library, I might find a possible problem in File System Specific Packets (UNIX Permissions Packet and FAT Permissions Packet). There are some documents under "File System Specific Packets" section in PAR3 spec. A PAR3 client may write favorite or safe items on a packet. Another PAR3 client may read favorite or safe items from a packet and will adopt the attributes or permission on a file. For example, storing and restoring timestamp is safe and may be useful.

Now, each item has 2 states; written value or zero. When a PAR3 client didn't write an item to eliminate a risk (or could not get the information), the item became zero. (PAR3 spec must fill zero for un-used bytes.) When another PAR3 client will read the item as zero bytes, it cannot distinguish the item value was originally zero or non-written item. This may be a fault, because zero has meaning as attributes.

For example, there is "i_mode" item in UNIX Permissions Packet. The item has bit, which is non-zero for read or write permission. When the value is zero, it means that all users have no read/write permission. Another PAR3 client may not determine that the file should have permission or not.

For example, there is "FileAttributes" item in FAT Permissions Packet. The item has a bit for "Archive" attribute. This bit is non-zero normally after creation or modification. When the value is zero, it means that it needs to erase the attribute, which was set by default. (Because this attribute isn't used nor refered mostly, there may be no problem.)

Thus, there should be a bit flag, which indicates the item was written or not. For example, setting either bit flag of Directory or File would be good. Both "i_mode" and "FileAttributes" items have these two bits. Because either bit flag is non-zero always, the written item cannot become zero. When a PAR3 client reads that an item is zero, it will indicate that another PAR3 client didn't write the item.

By the way, I felt that 1-byte might be enough for "FileAttributes" item in FAT Permissions Packet. There are many unused bits in the field. It's strange to allocate larger field size than required size. Anyway, it's impossible to use raw bit data from Win32API GetFileAttributes. So, I think that you don't need to keep original bit alignment. A programmer will be able to edit bit flag with ease.

mdnahas commented 2 years ago

I'm not sure what you mean by "favorite or safe".

I think you mean that a client might write some piece of data in the UNIX or FAT Permissions packet, but not all of them. For example, might write timestamps, but not the UID or GID.

The GNU C library supports all of the UNIX Permission packet fields, except for "xattr". https://www.gnu.org/software/libc/manual/html_node/File-Attributes.html I assumed that xattr could be left empty.

I assumed that any Windows client would support all the FAT Permissions packet fields.

The permission packets were not required, so any Java or Javascript client can just ignore them.

If a client wants to fill in some but not all parts of a Permissions packet, I figured that they could find reasonable default values:

Yes, if the encoding client writes those default values into the Permissions packet, the decoding client needs to respect them, if possible. But, then again, if they were not written into the file, the decoding client would need to make something up anyway.

Perhaps I should add that to the specification. Particularly about the UID/GID set to MAX_INT and usernames set to empty string

As for the size of the FAT Attributes field, I think it's okay to use 1 more byte to make it had to screw up. (The per-file overhead is about 130 bytes, so adding 1 more isn't too expensive.) Those 2 bytes are returned by GetFileAttributesA and accepted by SetFileAttributesA. Actually, those functions return/expect 4 bytes, but I thought only 2 bytes were necessary, so I did save 2 bytes.

mdnahas commented 2 years ago

I added the "unset values" to the specification. Parity_Volume_Set_Specification_v3.0.md

Yutaka-Sawada commented 2 years ago

I think you mean that a client might write some piece of data in the UNIX or FAT Permissions packet, but not all of them.

Yes, I do. Some users requested a feature to keep modification time of source files ago. C-runtime library will be able to restore the timestamp. Because modification time is the only common information between UNIX or FAT Permissions packets, storing the item may be useful for both users.

I figured that they could find reasonable default values:

  • timestamps set to current time

I think that default values for timestamp should be 0, because it's not a real used value. The definition of current time is unknown, as time spends while creating PAR3 files.

The same UNIX Permissions can be applied to multiple files, directories, or links. The same FAT Permissions can be applied to multiple files, directories, or links.

From PAR3 spec, a same packet may be applied to multiple files, directories, or links. But, it will be impossible by different timestamps, even when their permissions are same. I think that timestamps are different normally, unless a user set a specific time manually. Or, do most files have a same timestamp on Linux/UNIX ? If timestamp differ between files, each file needs to attach its own Permissions Packet. I feel that it may be good to split timestamp from permissions. For example, two independent packet types like; Permissions Packet and Timestamp Packet.

The i_mode value contain the lower 12 bits of EXT4's values for the i_mode. The top 4 bits are all zeroes.

From PAR3 spec, only 12-bit are used in 16-bits fireld. Then, is it ok to set 0x8180 (regular file, owner read+write) as default i_mode value ? Because I don't know about Linux/UNIX, I may be bad to touch UNIX Permissions Packet.

Start packets are used to generate the InputSetID, which is included in the header of every packet. The InputSetID is the first 8 bytes of the Blake3 hash of the body of the Start packet.

When I read how to calculate InputSetID, I understand that the 8-bytes value's randomness depends on another random 16-bytes hash value. Then, I felt that CRC-64 might be enough to represent the packet body, instead of using first 8-bytes of the Blake3 hash. Calculating Blake3 hash of another Blake3 hash may be over power. (Hashing two times may not give more randomness.)

There are many ways to generate a globally unique random number. One way is to use the fingerprint hash of the triple consisting of: a machine identifier, a process identifier, and a high-resolution timestamp.

I'm not sure that you intended this big difference between PAR2 and PAR3. In PAR2, SetID is specific to input files and block size, even when the value looks random. If a user make a recovery data for same files and block size, the resulting recovery data has same SetID and they are compatible. In PAR3, InputSetID will differ with everytime a user create a recovery data for same files. Even though their recovery data can be compatible, PAR3 client won't use other data by different InputSetID.

For example, par2cmdline has an option -f<n> : First Recovery-Block-Number. With this option, a user can append more recovery blocks to existing PAR2 files, if the user specifies the same source files and block size. When source files were changed, it will create incompatible recovery blocks with different SetID. QuickPar and MultiPar have such Extra feature, too.

Now, if par3cmdline has a similar feature, it will require a parent PAR3 file instead of specifying input files. It's almost same as parent PAR3 file for incremental backup. But it will inherit the parent's InputSetID and Start Packet, when their input files are same. So, it needs to verify input files before creating additional recovery blocks. If their input files were changed after the first creation, it will fail to create extra blocks in the same InputSetID. At that time, a user may select repairng to original files or processing incremental backup for the changed files.

Anyway, PAR3 developers and users should notice the different behavior from PAR2. Even when a user create PAR3 files for a same source files with same setting, creating multiple times simply will result in incompatible recovey files. (Though recovery blocks themselves are compatible, they have different InputSetID.) It requires a special feature to create extra recovery blocks with the same InputSetID.

mdnahas commented 2 years ago

I think that default values for timestamp should be 0, because it's not a real used value.

0 is a perfectly valid timestamp. In both UNIX and Windows formats!

2^64-1 is a better value.

I should add a note to the specification about times from the future.

From PAR3 spec, only 12-bit are used in 16-bits fireld. Then, is it ok to set 0x8180 (regular file, owner read+write)...

Good catch! I've updated the spec to use 0x0180.

BTW, setting the UNIX i_mode to 0x0180 and FAT Attributes to all 0s are going to act like default values. That is, the decoding client cannot tell the default value from an "unset" value. I am okay with that right now. In most cases where that is a problem, the permission packets shouldn't be in the file at all. If you can present an expected usage where that is a problem, we can talk about changing it.

Some users requested a feature to keep modification time of source files ago.

First, I agree with the users. If a file exists and has the correct data, I don't know why a Par2 client should modify it. The client should only have to read the file, not write it. But that's a decoding client behavior and has nothing to do with the file specification. I'd expect a Par3 client to behave the same if no Permissions Packets are present.

If timestamp differ between files, each file needs to attach its own Permissions Packet.

Yes, that is correct.

I feel that it may be good to split timestamp from permissions. For example, two independent packet types like; Permissions Packet and Timestamp Packet.

The overhead for packet header is 48 bytes. The timestamps are 24 bytes. The other permissions are usually going to be about 28 bytes.

So, in the current specification, there is 1 packet for each file and its length is 48+24+28 bytes. You propose to replace that with one packet that holds the permissions at length 48+28 bytes plus one packet per file holding the timestamps of length 48+24 bytes. Since the File packet overhead is already around 140 bytes, with a very large number of files you'd cut the per-file-overhead from 240 bytes to 212 bytes. I don't think it's a big enough win to be worth a different packet.

When I read how to calculate InputSetID, I understand that the 8-bytes value's randomness depends on another random 16-bytes hash value. Then, I felt that CRC-64 might be enough to represent the packet body, instead of using first 8-bytes of the Blake3 hash. Calculating Blake3 hash of another Blake3 hash may be over power. (Hashing two times may not give more randomness.)

The Blake3 hash has uniqueness guarantees that the CRC-64 does not have. We need the InputSetID to be unique.

In PAR3, InputSetID will differ with everytime a user create a recovery data for same files.

First, it doesn't have to. Client authors can use a technique similar to Par2's calculation of the RecoverySetID. Line 368 of the Markdown version of the spec says: "Another method for generating a globally unique number is, if the input set's data is known ahead of time, is to use a fingerprint hash of the parent's InputSetID, block size, Galois field parameters, and all the files' contents."

But Par3 does allow a client to use a globally-unique random number. The reason for the change from Par2 to Par3 was that some client authors reported that Par2's value was too restrictive. In Par2, an encoding client had to hash all the files and file names first, to generate the RecoverySetID. Only then, could it write out the packets to the Par2 file. Moreover, if anything changed, even just a file name, the RecoverySetID changed.

For Par3, I wanted to add a little more freedom. The InputSetID could be the same value as with Par2, or it could be a globally-unique random number.

And, yes, that design impacts the usage. In Par2, for the same set of input files, every client would create the same packets. But that's not the case with Par3. But, given the other features (like deduplication), that wasn't going to be guaranteed anyway. (In fact, I think I need to rewrite the specification to make sure that there aren't clashes with deduplication.)

If a Par3 client has an existing Par3 file, it can create an incremental backup or just create new Recovery Packets (and possibly matrix packets) that reuse the existing InputSetID.

Yes, you are correct when you say "it needs to verify input files before creating additional recovery blocks." But, to create Cauchy-matrix recovery blocks, the client needs to read all the files anyway. So that isn't a big deal.

But, if we assume two clients create two different Par3 files using the same block size and Galois field, it is possible for a client to identify overlapping input blocks of common files and reuse the recovery information. The client would definitely be more complicated than the Par2 client, but it is possible. And, if a client supports incremental backup, it may not be too different.

Yutaka-Sawada commented 2 years ago

While I read the GNU C Library manual to see difference from Microsoft C-runtime library, I found an interesting point. stat64() function gets file's information. There are 3 items as a file's timestamp;

st_atime = access time
st_mtime = modification time
st_ctime = attribute modification time

utime() function sets 2 items only (access time and modification time). It automatically sets attribute modification time to current time. So, there may be no reason to store ctime in UNIX Permissions Packet. Even when a PAR3 client reads the valid ctime in the packet, it cannot set the value on a real file anyway. Or, is there a way to edit the ctime in another function ? Though I don't know about Linux/UNIX, I feel that storing ctime will be useless.

I don't think it's a big enough win to be worth a different packet.

Oh, I see. You are right. That was a bad idea. Then, I came up with a perfect solution for FAT Permissions Packet and a maybe good solution for UNIX Permissions Packet. FAT Permissions Packet has 4 items, and the packet body size is 26-bytes. If the priority order is fixed, it's possible to determine which items exist by body size. The list of body size and contents is like below;

26-bytes = CreationTimestamp, LastAccessTimestamp, LastWriteTimestamp, FileAttributes
24-bytes = CreationTimestamp, LastAccessTimestamp, LastWriteTimestamp
18-bytes = CreationTimestamp, LastWriteTimestamp, FileAttributes
16-bytes = CreationTimestamp, LastWriteTimestamp
10-bytes = LastWriteTimestamp, FileAttributes
8-bytes  = LastWriteTimestamp
2-bytes  = FileAttributes

Because UNIX Permissions Packet has more items, it's impossible to select each. So, it splits timestamp and permissions only. The list of body size and contents is like below;

8-bytes   = mtime
16-bytes  = atime, mtime
24-bytes  = atime, ctime, mtime (But, I think ctime may be useless.)
more size = both timestamp and permissions

How about my new idea ? Single packet can store various items. The packet's body size will indicate the kind of content.

malaire commented 2 years ago

ctime is not meant to be set to any other time than current time. See e.g. https://unix.stackexchange.com/a/36105/182083

You cannot change the ctime by ordinary means. This is by design ...

malaire commented 2 years ago

btw, ctime is meant for backup tools to detect if file metadata or contents have changed since it can't be (easily) faked.

Could storing it help with incremental backups here?

mdnahas commented 2 years ago

Yes, mtime is the time the file's contents changed and ctime is when the file's metadata changed.

The ctime can only be set by (1) mounting the drive without mounting the file system or (2) changing the system time. (I think #2 is called a "time storm".) And we don't want clients doing either of those.

The "tar" program stores ctime. If either the mtime or the ctime on a file changes, tar will backup the file and its metadata.

I believe @malaire is right --- the ctime might be useful for incremental backups. If the ctime changes, we update the UNIX Permissions packet. (And, obviously, if the mtime changes, we update the file's contents.)

The other way to check if permissions changed would be to call stat() and getxattrs() for every file. That will probably be expensive with xattrs. Also, "tar" stores ctime and, when in doubt, I'll imitate "tar".

I will add a note to the specification about ctime.

@Yutaka-Sawada Your idea of various-sized permissions packets is interesting. But, it requires picking an ordering the fields in importance. That's easy for the FAT Permissions packet. I don't see an obvious ordering for UNIX Permission fields.

mdnahas commented 2 years ago

Latest Markdown version of the specification.

Parity_Volume_Set_Specification_v3.0.md

Yutaka-Sawada commented 2 years ago

A chunk description starts with the length of the chunk and a fingerprint hash of the chunk's entire contents. If the chunk's contents are unknown and not protected by the recovery data, the fingerprint hash value is set to all zeroes. (This is used with the "Par inside" feature described below.)

When I tried to construct chunk information for files, I thought that 16-bytes hash of chunk might be troublesome. When chunk size is equal or larger than block size, External Data packets contains the hash of each blocks in the chunk. When chunk size is smaller than block size or there is a tail, the hash of the last piece is saved in the File Packet. So, hash of the chunk's entire contents isn't required to verify the file data.

When it can keep a whole file data on memory, there is no problem in calculating hash of chunks. But, it may be annoying to read and hash chunk's data from the first byte to the last byte ordinary, when file size is very large. File access order is the problem, because it wants to read every input blocks 's same offset at once, instead of reading each block incrementally. Technically, it's possible to read a whole file at first and read every blocks next, but it's not fast by reading the same file data two times. Also, I'm not sure the worth to calculate a hash value, which has no use later.

Thus, I suggest to use 1-byte flag, instead of 16-bytes hash. It indicates the chunk usage. If the flag is 0, the chunk isn't protected. (such like PAR inside) If the flag is 1, the chunk's hash may be stored in External Data packets. If the flag is 2, the chunk's data may be stored in Data packets. If the flag is 3, there may be both External Data packets and Data packets. This way will reduce packet size by 15-bytes, and PAR3 clients won't need to calculate additional redundant hash value.

mdnahas commented 2 years ago

This sounds like the same argument again. That there is a hash of each block, so we don't need a per-file hash. I believe we need per-file hashes. I've relaxed my position a little to have per-chunk hashes instead of per-file hashes, but am still having second thoughts about it.

The very first thing an archiving program should do when encoding is compute a checksum of the whole dataset. And the very last thing it should do when decoding data is verify that checksum.

Any extra calculation or extra complication to calculating that checksum is an opportunity for a bug to sneak in and for the client to report "the data is just fine!" when it isn't. That chucksum is the most important guarantee to our users. If we ever say "the data is just like it was at the beginning" and it isn't, we've lost the trust of all our users.

I know that per-file hashes (or per-chunk hashes) are extra work. And they may require a second pass over the data, or a more complicated client to avoid two passes. But I think it is important and worth it.

As for the details, I'm not sure what you mean by "File access order is the problem, because it wants to read every input blocks 's same offset at once, instead of reading each block incrementally." I'm not sure what "it" is. Is it the per-chunk hash algorithm? Is it the per-input-block hash algorithm? Is it the calculation of the recovery data? Because I'm pretty sure all of those can be coded to pass over the file incrementally from start to finish. I don't know of any algorithm that has to process the first byte (or Galois field) of every input block at once, before going to the second byte (or Galois field) of every input block.

I don't know how every Blake3 library is coded, but I can believe that some require the data to arrive sequentially. That is, file has to be loaded in order. That may be a problem for clients that want to go very fast and have multiple CPUs available to calculate the hash. In that case, the encoding client might want to break the input file into N equal-sized chunks, so that each of the N CPUs can each calculate the hash for one chunk of the file.

Yutaka-Sawada commented 2 years ago

The very first thing an archiving program should do when encoding is compute a checksum of the whole dataset. And the very last thing it should do when decoding data is verify that checksum. That chucksum is the most important guarantee to our users.

I understand what you say. Verifying a whole file with a checksum is important. But, how to calculate the checksum at creating time is the problem. I felt that calculating BLAKE3 of chunks were worthless, while I calculated BLAKE3 of each input block in every chunks.

How about using general CRC-32 for each file's checksum ? Most archivers and checksum tools support CRC-32. (On the other hand, most tools don't support BLAKE3 yet.) Because CRC-32 is different algorithm from checksum of input blocks, file's CRC-32 will give additional trust. (However, my aim is that CRCs can join at calculation.) It will check CRC-64 and BLAKE3 of input blocks, and CRC-32 of input file as the final confirmation. They are worth to check, instead of checking BLAKE3 hashes of same file data two times.

Also, users will be able to see raw CRC-32 values in recovery set, and other tools can confrim the file reading was correct or it was verified correctly. One weak point of PAR3 is that using hash algorithms are minority. By using popular CRC-32 as file's checksum, PAR3's result become compatible with other tools. Though MD5 was the common checksum in PAR2, we use BLAKE3 instead of MD5 in PAR3. While we know BLAKE3 would be more reliable than PAR2's MD5 or CRC-32, it's difficult to check by 3rd party tool. So, storing CRC-32 in PAR3 may be good for compatibility.

I've relaxed my position a little to have per-chunk hashes instead of per-file hashes, but am still having second thoughts about it.

If PAR3 will use 1-byte flag in Chunk Descriptions and use CRC-32 as per-file hashes, we don't need per-chunk hashes. In this case, file's checksum will skip non-protected chunk bytes simply. Storing 16-bytes zeros as a flag for non-protected chunk is strange. From the nature of CRC, CRC-32 is suitable for per-file hashes. (It's easy to calculate on very large files at creation.)

malaire commented 2 years ago

Any CRC is unsuitable as file checksum as they are too short and have too many collisions (and are also easy to forge, but not sure if that issue here).

Personally I think that every file must have single cryptographic hash (at least 128 bits, but I'd prefer 256) which covers whole file so that it is trivial to check that file was created correctly.

Yutaka-Sawada commented 2 years ago

I suggested CRC-32 for checking bug or failure in implementation. It doesn't need to have strength against forge or collision. There is BLAKE3 already for checking error or damage in file data.

Personally I think that every file must have single cryptographic hash

If it's a single hash to check integrity of files, I think so, too. Because PAR3 has BLAKE3 hash for every input blocks in a file, this case won't require cryptographic strength to see the correctness of read file data or repaired result. But, it may fail to arrange the blocks or miss somewhere. Then, checksum of whole input blocks is calculated as chunk's hash. In this limited usage, I thought that CRC-32 was suitable, because it's easy to concatenate.

All clients MUST be able to decoded File packets with multiple chunk descriptions and perform recovery.

There is a reason, why I wanted to shorten Chunk Descriptions. From PAR3 spec, a File Packet may contain multiple Chunk Descriptions. And all PAR3 clients must treat them. While I was thinking how to implement deduplication, I found that it's difficult (or impossible) to load many Chunk Descriptions on memory in rare case.

For example, there is a 1 MB file with random data. By setting block size to be 1-byte, it will make 256 input blocks and 1,048,320 duplicate blocks. There may be max 1,048,576 Chunk Descriptions in the File Packet. Because one Chunk Description is total 32-bytes now, Max 1,048,576 Chunk Descriptions consume around 32 MB. Though you think that 32 MB is small, a 1 GB file will make a File Packet of 32 GB. I predict that most PAR3 clients won't be able to decode so large packet. Small block size may result in many duplicated blocks. That was why I suggested to set minimum block size to be 8-bytes or 40-bytes ago.

mdnahas commented 2 years ago

We use the Blake3 hash for uniqueness. We use CRCs for rolling hashes.

We want the data, after decoding, to match what it was at the beginning. Uniquely. That is the entire purpose of having 1 hash for all of the data.

Thus, we use a tree of Blake3 hashes for it.

@malaire I agree that I would like every file to have a cryptographic hash. But Par3 doesn't protect all the data, in order to allow Par-in-Par. So, hashing the entire file was problematic. I settled on cryptographic hashes for each chunk. I feel that in most cases, each file will be one chunk.

Yutaka-Sawada commented 2 years ago

I came up with the best perfect solution for a problem; lack of file's checksum. I explain the problem and our opinions here again.

Design of PAR3

PAR3 treats an input file as an array of input blocks. Each input block is protected by BLAKE3 hash and CRC-64. For deduplication, incremental backup, or PAR inside, an input file may have single or multiple chunks, which consist of multiple input blocks or unknown data. Though each chunk is protected by BLAKE3 hash, each input file has no checksum for the entire file data.

The construction of a file is like below;

When a file has single chunk,
file{ chunk(block, block, block, block, block, block, block, block, block) }

When a file has multiple chunks,
file{ chunk(block, block, block), chunk(block, block, block), chunk(block, block, block) }

When a file has multiple chunks and unknown data,
file{ chunk(block, block, block), chunk(block, block, block), chunk(block, block), chunk(unknown) }

My (Yutaka Sawada's) opinion

I like simple construction and speed. Because I believe that tree of BLAKE3 hashes can protect the entire file data indirectly, I don't complain about the lack of file's checksum. Even when there is no checksum of the file itself, internal blocks' BLAKE3 hashes would be enough to represent the completeness. Then, I oppose checksum of chunk. I want to remove the BLAKE3 hash of chunk as a troublesome factor. While blocks' BLAKE3 hashes can proof the integrity of file, it should not require additional BLAKE3 hash for chunk.

Markus Laire' opinion

He wants that every file has single cryptographic hash, which covers whole file. Because it's impossible to check PAR3's array of BLAKE3 hashes manually, single hash is preferable for compatibility with other checksum tools. Some users requested that PAR3 should have included multiple kind of hashes to check integrity of source files ago. I understand their hope. There are some maniac or paranoid users, who calculate hashes of their files. Parchive is the last wish for them, when their files were broken.

Michael Nahas's opinion

While PAR3 protects file data by tree-structure of BLAKE3 hashes, PAR3 clients may happen to fail the calculation. So, it checks BLAKE3 hash of each chunk. He refused to remove checksum of chunk, as oppose to my suggestion.

Though he agreed the requirement of file's cryptographic hash, he could not put checksum of the file in File Packet. For "PAR inside" feature, a chunk may not have checksum for unknown data. When a chunk isn't protected, the whole file cannot be protected, too. Because of the rare case, PAR3 doesn't contain checksum of file.

Idea of Checksum Packet

Now, I suggest a new packet type; Checksum Packet. Checksum Packet contains multiple hashes of an input file. It's linked to an input file as a packet for options, as same as UNIX Permissions Packet. When a file has non-protected chunks, it cannot have Checksum Packet, because the file data is unknown. Checksum Packet behaves as if extra packet for additional insurance. If a user want to protect his files strongly, he can add favorite cryptographic hashes in this packet. PAR3 clients will be able to support many hash functions. Markus Laire would be satisfied with this solution.

The Checksum Packet has a type value of "PAR CHK\0" (ASCII). The packet's body contains the following:

*Table: Checksum Packet Body Contents*

| Length (bytes) | Type | Description |
|---------------:|:-----|:------------|
| ?*? | checksum descriptions | see below. |

*Table: Checksum Description Layout*

| Length (bytes) | Type | Description |
|---------------:|:-----|:------------|
| 1 | unsigned int | length of checksum description |
| ? | UTF-8 string | name of hash function (It's NUL-terminated as boundary.) |
| ? | checksum bytes | hash of the entire file |

For example, MD5 hash is stored as Length = 20, Name is "MD5\0", Hash is 16-bytes. SHA-1 hash is stored as Length = 26, Name is "SHA-1\0", Hash is 20-bytes. SHA-256 hash is stored as Length = 40, Name is "SHA-256\0", Hash is 32-bytes. SHA-512 hash is stored as Length = 72, Name is "SHA-512\0", Hash is 64-bytes. SHA-3 (256-bit) hash is stored as Length = 38, Name is "SHA-3\0", Hash is 32-bytes. SHA-3 (512-bit) hash is stored as Length = 70, Name is "SHA-3\0", Hash is 64-bytes. Whirlpool hash is stored as Length = 74, Name is "Whirlpool\0", Hash is 64-bytes. BLAKE3 (128-bit) hash is stored as Length = 23, Name is "BLAKE3\0", Hash is 16-bytes. BLAKE3 (256-bit) hash is stored as Length = 39, Name is "BLAKE3\0", Hash is 32-bytes. It's possible to store arbitrary output length in a same name, as the length is different.

Change of Chunk Description

As mentioned above, I solved the fault of PAR3, which cannot check a whole file with general hash functions by compatible way. When Checksum Packet protects file data with cryptographic hash functions, checksum of chunk doesn't need so strong hash. Then, I suggest to use CRC-32 for checksum of chunks instead of current BLAKE3 hash. Though I wanted to remove checksum of chunk, I compromise the Michael Nahas's concern for internal calculation failure.

CRC-32 can be the minimum error checker for both chunks and the entire file. When a user wants speed and doesn't make Checksum Packet, CRC-32 is shown as a checksum of whole file. (A PAR3 client can concatenate CRC-32s of chunks to show CRC-32 of entire file data.) As a checksum of file, 4-bytes is very small, but it's better than nothing. (Currently, PAR3 doesn't have checksum of file at all.)

Because 32-bit is too small to indicate a non-protected chunk, it may require 1-byte flag. Even when there are 1-byte flag and 4-bytes CRC-32 for each chunk, the total 5-bytes is less than BLAKE3 hash's 16-bytes. Small chunk description is good for small packet size.

mdnahas commented 2 years ago

Par3 is about safety not speed. Saying "I can make this much faster by giving up safety" is going against the total point of Par3. Saying "safety is optional" is going against Par3.

I care about speed, but it is a far secondary consideration far below safety.

We should be discussing how to put per-file hash back in. Can we get rid of the per-chunk checksums and say that unprotected chunks are assumed to be 0-bytes when calculating the per-file hash?

Yutaka-Sawada commented 2 years ago

Can we get rid of the per-chunk checksums and say that unprotected chunks are assumed to be 0-bytes when calculating the per-file hash?

Yes, we can, though "Par Inside Another File" feature needs tweak. When a user cannot check "per-file hash" with another tool, the hash value is impossible to confirm, as same as "pre-block hash" or "per-chunk hash". If a PAR3 client shows "per-file hash" to a user, it should have a feature to reconstruct original "outside file".

For example, there is a ZIP file and its hash value. A user can check the integrity of ZIP file by re-calculating the hash. When a PAR3 client creates "PAR inside ZIP", the hash value becomes different. When a PAR3 client removes "inside PAR" from the ZIP file, the hash value returns to be same. So, PAR3 clients will have two repair mode for "PAR inside ZIP"; reconstruct original ZIP file (by removing PAR3 packets), or repair modified ZIP file (by keeping PAR3 packets).

To perform such "Remove PAR from Outside File" feature, "PAR inside" feature has some restrictions. At first, sum of protected chunks must become the original outside file. When a PAR3 client puts PAR3 packets or extra data between or after the original file, such additional bytes must belong to unprotected chunk. So that, another PAR3 client will be able to join the protected chunks to construct an original file, when a user wants to check the file with "per-file hash" manually.

Yutaka-Sawada commented 2 years ago

When I'm making packets, I found a problem in the order of packets.

"Order of Packets" in PAR3 spec is written as following;

In order for a client to do single-pass recovery, the recommended order of packets is:

  • Creator
  • Start
  • Matrix
  • Data or External Data
  • Root
  • Recovery Data

The File and Directory packets can appear in any place.

To do single-pass recovery, File and Directory and Root packets must exist before Data packets. Also, External Data should exist before Data packets. Data packets contain raw data of input blocks. A PAR3 client needs to know which input file to put the input block at recovery. A PAR3 client needs to know which input file is damaged before putting the input block. So, a PAR3 client wants to read file and directory tree information at first.

By the way, the order of File and Directory and Root packets seems to be decided at creation time. Because Directory Packet includes checksums of child files and directories, a PAR3 client needs to make packets of children at first. So, the making order of packets must be; File packets -> Directory packets -> Root packet. Normally, making order and writing order is same.

Then, the recommended order of packets would be:

  • Creator
  • Start
  • Matrix
  • File
  • Directory
  • Root
  • External Data
  • Data
  • Recovery Data
mdnahas commented 2 years ago

Can we get rid of the per-chunk checksums and say that unprotected chunks are assumed to be 0-bytes when calculating the per-file hash?

Yes, we can, though "Par Inside Another File" feature needs tweak. When a user cannot check "per-file hash" with another tool, the hash value is impossible to confirm.

I don't expect another program, like "b3sum", to be able to confirm the hashes for files with Par-inside. In most cases, we're modifying those files. It's a rare enough case that I'm not going to worry about it.

I'll try to spend some time working on putting the per-file hash back into the specification.

Yutaka-Sawada commented 2 years ago

I found my restriction against PAR3 spec : "Link Packet". It seems that Microsoft C-runtime library cannot distinguish a real file, hard or symbolic link on Windows OS. I don't know how they are searched or looked. Because adding/removing Link seems to require administrator privilege on Windows OS, it's difficult to use by normal applications like PAR3 clients. So, I won't support the Link Packet in my PAR3 client for Windows OS. Then, implementing Link Packet for Linux/UNIX OS will be another programer's task.

mdnahas commented 2 years ago

To do single-pass recovery, File and Directory and Root packets must exist before Data packets. Also, External Data should exist before Data packets. Data packets contain raw data of input blocks. A PAR3 client needs to know which input file to put the input block at recovery. A PAR3 client needs to know which input file is damaged before putting the input block. So, a PAR3 client wants to read file and directory tree information at first.

Ok.

By the way, the order of File and Directory and Root packets seems to be decided at creation time. Because Directory Packet includes checksums of child files and directories, a PAR3 client needs to make packets of children at first. So, the making order of packets must be; File packets -> Directory packets -> Root packet. Normally, making order and writing order is same.

"Normally making order and writing order is the same." is true, but that doesn't need to be required by the specification. The specification is about enabling a feature. And that order is not required to enable the feature. In fact, if I was writing the client, I'd probably do a depth-first-search of the directory tree, which means File packets and Directory packets would be mixed together.

For example: /A/a.txt /A/b.txt /c.txt would probably be written out: File Packet "a.txt", File Packet "b.txt", Directory Packet "A", File Packet "c.txt", Root Packet

So, File packets would be present before and after Directory packets.

Then, the recommended order of packets would be: ...

We need to think about what we mean by "single pass recovery". That is, what algorithm do we expect the decoding client to run.

1) After reading each Matrix packets, use the hints to allocate N buffers for recovery data. Zero out those buffers. 2) Read Data and External Data Packets. Use the data and the Matrix packets to update the buffers. 3) Read the Recovery Data packets. Use the contents and the Matrix packets to update the buffers. 4) Determine the missing input blocks. 5) Invert the recovery matrix 6) Multiply the buffers by the inverted recovery matrix, to generate the recovered blocks. 7) Write out the recovered blocks

We have a few requirements for the code.

First, before anything, we need the InputSetID. That's in the Start packet, so that has to be first.

Second, to determine the missing input blocks, we need the File, Directory, and Root packets to be available before step 4.

Third, for an External Data packet, we will need to know the files to search for the data. And that means we need the File, Directory, and Root packets to be available before step 2.

So the recommended order of packets would be:

And the File/Directory/Root packets need to arrive before the External Data or Recovery Data packets.

BTW, this ordering will require some storage if the encoding client wants to make a single pass over all files. As the client passed over the input files, it will have to store the External Data packets as it calculates the files' checksum. At the end of each file, it can write the File packet with the file's checkum. Only after all the File packets are written (and the Root packet) can it write out the External Data packets.

It may be simplest to just drop the section. It is complicated to describe. And I'm not sure we need it. In some cases, it will be easy for the client to read the Par3 files into memory and, only afterwards, do a single pass over the input files. In the other cases, using memory efficiently will be more important than speed.

mdnahas commented 2 years ago

I found my restriction against PAR3 spec : "Link Packet". It seems that Microsoft C-runtime library cannot distinguish a real file, hard or symbolic link on Windows OS. I don't know how they are searched or looked. Because adding/removing Link seems to require administrator privilege on Windows OS, it's difficult to use by normal applications like PAR3 clients. So, I won't support the Link Packet in my PAR3 client for Windows OS. Then, implementing Link Packet for Linux/UNIX OS will be another programer's task.

That's fine. I can help.

Yutaka-Sawada commented 2 years ago

I split input files and input directories for independent sorting. I sort input directories from children to parent. This is good for making Directory Packets. I sort input files by tail size for efficient tail packing.

For example, there are 5 files; size1100.txt (1100 bytes) size1200.txt (1200 bytes) size1300.txt (1300 bytes) size1600.txt (1600 bytes) size1800.txt (1800 bytes)

Total file size = 7000 bytes Block size = 1000 bytes

In PAR2, these 5 files consume 10 source blocks. Total block data = 1000 * 10 = 10000 bytes. So, the efficiency of block usage is only 70%.

In PAR3, these 5 files may consume less source blocks by tail packing feature. After input "size1100.txt", there is 1 full block and 1 tail block[100]. After input "size1200.txt", there is 2 full block and 1 tail block[100+200]. After input "size1300.txt", there is 3 full block and 1 tail block[100+200+300]. After input "size1600.txt", there is 4 full block and 2 tail block[100+200+300][600]. After input "size1800.txt", there is 5 full block and 3 tail block[100+200+300][600][800]. Total block data = 1000 * 8 = 8000 bytes. So, the efficiency of block usage is 87.5%. This is better than PAR2.

When I sort files by tail size, the order is from big to small; size1800.txt (1800 bytes -> tail 800 bytes) size1600.txt (1600 bytes -> tail 600 bytes) size1300.txt (1300 bytes -> tail 300 bytes) size1200.txt (1200 bytes -> tail 200 bytes) size1100.txt (1100 bytes -> tail 100 bytes) After input "size1800.txt", there is 1 full block and 1 tail block[800]. After input "size1600.txt", there is 2 full block and 2 tail block[800][600]. After input "size1300.txt", there is 3 full block and 2 tail block[800][600+300]. After input "size1200.txt", there is 4 full block and 2 tail block[800+200][600+300]. After input "size1100.txt", there is 5 full block and 2 tail block[800+200][600+300+100]. Total block data = 1000 * 7 = 7000 bytes. So, the efficiency of block usage is 100% !

Though I sometimes thought that PAR3 spec might be too complex, "tail packing" is the remarkable bright feature. I was impressed, when I implemented the mechanism and saw the result. (I was doubtful about such complex system was possible and gave up ago, hehe.) Small files won't cause efficiency problem any more. PAR3 would be worth to use for varied size input files. Thanks Michael Nahas, you did great work !

Yutaka-Sawada commented 2 years ago

When I wrote Root Packet, I saw attributes field which indicates absolute path. I felt that this item might be worthless. General PAR clients will need to check and sanitize the path anyway.

From reading Clarifications and Commentary: Security section, using absolute path seems to be risky. Even when it rquires user's confirmation, some users won't understand the risk.

Now, there is a flag for absolute path in Root Packet. If the flag says absolute path, my PAR3 client will just ignore the flag, and treat as relative path. Even when the flag says relative path, my PAR3 client will need to check the path, too. If someone tries to attack, he doesn't set the correct flag frankly. There is no reason to believe the flag.

I don't oppose the usage of absolute path in PAR3 file. But, the decision should not rely unknown PAR3 file nor a user's choice. When someone wants to use absolute path in his PAR3 files, he should use a special PAR3 client. Only when PAR3 files' source is reliable, the PAR3 client accepts the absolute path. When PAR3 files are known to use absolute path at first, it doesn't require the flag.

Yutaka-Sawada commented 2 years ago

When generating multiple Par3 files, it is expected that one file be generated without any Recovery Data packets and containing all the packets needed to verify correct transmission of the files. That is, contain the Start, External Data, File, Directory, and Root packets.

This file is called as Index File in general. It may be good to show the naming in official.

The other files should either (1) include duplicates of all those vital packets or (2) use "Par inside Par" to protect damage to them.

I tried to make Data Packets for test. It's possible to see how are input blocks by reading Data Packets. PAR3 files with Input Blocks should include multiple copies of vital packets, as same as normal PAR3 files with Recovery Blocks. Then, I refer how many duplication in PAR2 files now.

PAR2's redundancy of critical packets;
number of blocks = 0 ~ 1 : number of copies = 1
number of blocks = 2 ~ 3 : number of copies = 2
number of blocks = 4 ~ 7 : number of copies = 3
number of blocks = 8 ~ 15 : number of copies = 4
number of blocks = 16 ~ 31 : number of copies = 5
number of blocks = 32 ~ 63 : number of copies = 6
number of blocks = 64 ~ 127 : number of copies = 7
number of blocks = 128 ~ 255 : number of copies = 8
number of blocks = 256 ~ 511 : number of copies = 9
number of blocks = 512 ~ 1023 : number of copies = 10
number of blocks = 1024 ~ 2047 : number of copies = 11
number of blocks = 2048 ~ 4095 : number of copies = 12
number of blocks = 4096 ~ 8191 : number of copies = 13
number of blocks = 8192 ~ 16383 : number of copies = 14
number of blocks = 16384 ~ 32767 : number of copies = 15
number of blocks = 32768 ~ 65535 : number of copies = 16

"Par inside Par" feature is difficlut to implement.

There is a problem; duplicating vital packet would take up a lot of space. To solve this, "Par inside Par" feature was made. But, I think that the feature may be too complex. Basically, it needs to create PAR3 files many times. At first, it creates multiple PAR3 files for input set. Next, it creates "Par inside Par" for each PAR3 file. Though a PAR3 client will be able to create additional "Par inside Par" automatically, the behavior is very complex and will be difficult to implement.

Also, it's difficult to create readable "Par inside Par". If packet's boundary is different from block size, PAR3 clients cannot read "outside Par" while protected. It's too complex to set padding and prepare space for "inside Par". Though the idea of "Par inside Par" is interesting and it can protect a whole PAR3 file, the implementation requires high programming skill.

Is it OK to reduce copies of vital packets ?

Therefore, most PAR3 clients will just repeat vital packets as same as PAR2. But, it's good to reduce the number of duplication. PAR2 's packet repetition is (power of 2)'s exponent + 1. When there are 256 blocks, 256 = 2^8, then 8 + 1 = 9 copies. I feel that (power of 4)'s exponent + 1 will be enough. How do others think less number of copies like below ?

PAR3's redundancy of critical packets;
number of blocks = 0 ~ 3 : number of copies = 1
number of blocks = 4 ~ 15 : number of copies = 2
number of blocks = 16 ~ 63 : number of copies = 3
number of blocks = 64 ~ 255 : number of copies = 4
number of blocks = 256 ~ 1023 : number of copies = 5
number of blocks = 1024 ~ 4095 : number of copies = 6
number of blocks = 4096 ~ 16383 : number of copies = 7
number of blocks = 16384 ~ 65535 : number of copies = 8
Yutaka-Sawada commented 2 years ago

After I suggested to reduce duplication of vital packets, I evaluated the risk of fewer packets.

For example, a PAR3 Index File may consist of 100 vital packets. These vital packets are repeated multiple times and scattered through out every PAR3 recovery file. When a PAR3 file includes 10 times duplication, there are 1000 packets in the file. While the number of additional packets is 900, only 10 loss of same packet type can disable the PAR3 file. If all 10 Start Packets are missing, other 990 packets become useless. So, duplicating 10 times can defend from 9 damages in the worst case.

Then, half number of duplication results in lesser protection. In the above example, duplicating 5 times can defend from 4 damages in the worst case. By reducing duplication, PAR3 file will be smaller than PAR2 file. But, weaken recovery file is bad.

Idea of XOR Packet

So, I came up with a simple easy method to perform same protection. My idea is putting "XOR Packet" on each PAR3 recovery file. "XOR Packet" represents XOR of all vital packets; Start packet, External Data Packet, Matrix Packet, File Packet, Directory Packet, and Root Packet. Because starting 8-bytes of PAR3 packets are same, the Magic sequence is excluded. "XOR Packet" contains the number of vital packets and XOR bytes of vital packets except Magic sequence. The packet body size is same as the largest vital packet.

The XOR Packet has a type value of "PAR XOR\0" (ASCII). The packet's body contains the following:

*Table: XOR Packet Body Contents*

| Length (bytes) | Type | Description |
|---------------:|:-----|:------------|
| 8 | unsigned int | Number of vital packets |
| ? | byte[]       | XOR of vital packets except Magic sequence |

"XOR Packet" is a parity of vital packets. If all packets of one type are missing, the packet can be recovered by XORing vital packets of other types. This method can recover a packet type until two types were lost. In the above example of duplicating 5 times, when all 5 Start Packets are missing, Start Packet can be recovered by other packet types. If all 5 Start Packets and 5 Root Packets were lost, they cannot be recovered. So, duplicating 5 times with "XOR Packet" can defend from 9 damages in worst case. This is the same protection power as PAR2 style duplication.

Because "XOR Packet" is small as compared to repeating all packets, the additional size won't be problem. Duplicating 10 times of 100 vital packets results in 100 10 = 1000 packets. Duplicating 5 times with "XOR Packet" results in (100 + 1) 5 = 505 packets. While PAR3 files will be smaller than PAR2 by less number of duplicated packets, they can defend from same number of damages.

Yutaka-Sawada commented 2 years ago

I tried to implement deduplication in 2 methods. One method compares checksums of input blocks at ordinary offset. It can detect duplicated files on different directories. It's fast and uses less memory. Second method compares checksums of input blocks at any offset. I use slide search by rolling hash, which is mentioned on PAR3 spec. I confirmed that CRC-64 can be usable as rolling hash. It's slow and uses additional memory of 2 blocks. Though I don't write their packets yet, it seems to work. But, I might mistake somewhere. I need to make verify or repair command to check the correctness of packets.

mdnahas commented 2 years ago

I don't oppose the usage of absolute path in PAR3 file. But, the decision should not rely unknown PAR3 file nor a user's choice. When someone wants to use absolute path in his PAR3 files, he should use a special PAR3 client. Only when PAR3 files' source is reliable, the PAR3 client accepts the absolute path. When PAR3 files are known to use absolute path at first, it doesn't require the flag.

Certainly, the default behavior should be to ignore absolute paths (or fail if they are present). My thought was that to use absolute paths, the user had to add an extra flag "--allow-absolute-paths".

For the reference implementation, I believe we should support absolute paths. It is part of the specification.

mdnahas commented 2 years ago

This file is called as Index File in general. It may be good to show the naming in official.

I'll do that in the next rewrite of the spec.

mdnahas commented 2 years ago

To calculate the number of copies of vital packets, you want to make sure that all of them survive with very very high probability. If there are N copies of a vital packet and each has a 5% chance of not arriving, the probability of that that packet not arriving is 0.05^N. If there are K different vital packets, the chance of failure is K0.05^N. Let's now replace 0.05 by a per-packet failure rate F and we have a formula KF^N.

If we want failure to receive all packets to be less than 1 in a million, you want:

K*F^N < 0.000001.

F^N < 0.000001/K

log(F^N) < log(0.000001/K)

N*log(F) < log(0.000001) - log(K)

N < (log(0.000001) - log(K))/log(F)

You can guess the individual failure rate F from the user's parameters. If number of recovery blocks is 5% of the number of input blocks, F should be 5%.

If K=100 and F=1%, the number of copies should be 4.5, which rounds up to 5. If K=100 and F=2%, the number of copies should be 5.3, which rounds up to 6 If K=100 and F=5%, the number of copies should be 6.9, which rounds up to 7. If K=100 and F=10%, the number of copies should be 9.

If K=10 and F=5%, the number of copies should be 6.1, which rounds up to 7. If K=100 and F=5%, the number of copies should be 6.9, which rounds up to 7. If K=1000 and F=5%, the number of copies should be 7.7, which rounds up to 8. If K=10000 and F=5%, the number of copies should be 8.5, which rounds up to 9.

The number increases with the number of recovery blocks, only because a users with more recovery blocks expects more failures. The number increases with the number of vital packets, because more things can go wrong.

mdnahas commented 2 years ago

I don't like the XOR packet idea. I know Par-inside-Par is difficult to implement, but I think it is a better way to do recovery of vital packets. The alternative to Par-inside-Par is repeating vital packets. It doesn't take many repetitions to get very good assurances that the vital packets survive.

mdnahas commented 2 years ago

Have you added any code to GitHub? I might be able to take a look at it next weekend.

Yutaka-Sawada commented 2 years ago

Idea for "Par inside", setting attributes in Root Packet

When I read that Chunk Description in File Packet may indicate the usage of "Par inside", I found a problem in the "Par inside Par" feature. While a PAR3 recovery set is protecting another PAR3 file, there are two InputSetIDs in the protected PAR3 file. Though they are distinguishable by their InputSetIDs, it needs to check all chunks in File Packet to determine either is "inside" or "outside". The searching order to distinguish "side" is like below; InputSetID -> Start Packet -> Root Packet -> File Packet -> Chunk Descriptions

If a PAR3 client finds another InputSetID, it needs to check the Start Packet. By testing InputSetID and 16-bytes fingerprint hash in another Start Packet, it can determine the different recovery set is own child or independent. So, the searching order to distinguish incremental backup is like below; InputSetID -> Start Packet -> Root Packet

Then, I think that similar method is possible for "Par inside Par". There is attributes item in Root Packet. By using 1-bit of attributes to indicate "Par inside" feature, a PAR3 client will be able to know "side". This will be much faster than reading all chunk descriptions.

Have you added any code to GitHub?

At this time, I implemented creation of Index File. Though I need to implement Verify command to check the packet data, the progression is very slow. I find many bugs everyday. I will put the current source code for Windows OS later.

Yutaka-Sawada commented 2 years ago

Problem in "incremental backup" feature

For verification, PAR3 clients need to handle packets of multiple SetIDs at once. Though "incremental backup" feature is interesting and will be useful, there seems to be a fault in the mechanism. (Or, I might miss-understand somewhere.) At first, I write the intended flow of "incremental backup" below.

For example, PAR3 file is made for (FileA and FileB).

Input files : FileA, FileB
PAR3 file : PARfileAB.par3 (SetID = AB)

After adding FileC, "incremental backup" made additional PAR3 file. The child's SetID depends on the parent's (SetID and Root Packet's checksum). So, I write them as AB and ABchild here.

Input files : FileA, FileB, FileC
Parent PAR3 file : PARfileAB.par3 (SetID = AB)
Child PAR3 file : PARfileAB+C.par3 (SetID = ABchild)

The child PAR3 file includes parent's SetID. When a PAR3 client reads Start Packet in "PARfileABC.par3 (SetID = ABchild)", it knows that packets of (SetID = AB) are required, too. Then, the PAR3 client will read packets of both SetIDs.

Next after adding FileD, "incremental backup" made additional PAR3 file. The grandchild's SetID depends on the parent (and grandparent).

Input files : FileA, FileB, FileC, FileD
Parent PAR3 file : PARfileAB.par3 (SetID = AB)
Child PAR3 file : PARfileAB+C.par3 (SetID = ABchild)
Grandchild PAR3 file : PARfileAB+C+D.par3 (SetID = ABgrandchild)

Even when there are 3 PAR3 files of different SetIDs, a PAR3 client will understand their relation and read packets of the 3 SetIDs. This way is the planned good flow. I write a bad flow below.

A user may "incremental backup" for ancestor PAR3 file, instead of the last descendant. In the above example, he may add FileD to (FileA and FileB) independently.

Input files : FileA, FileB, FileC, FileD
Parent PAR3 file : PARfileAB.par3 (SetID = AB)
First child PAR3 file : PARfileAB+C.par3 (SetID = ABchild)
Second child PAR3 file : PARfileAB+D.par3 (SetID = ABchild)

Or, he may add (FileC and FileD) to (FileA and FileB) instead of adding FileD to (FileA and FileB and FileC).

Input files : FileA, FileB, FileC, FileD
Parent PAR3 file : PARfileAB.par3 (SetID = AB)
First child PAR3 file : PARfileAB+C.par3 (SetID = ABchild)
Second child PAR3 file : PARfileAB+CD.par3 (SetID = ABchild)

Though the contents of "PARfileAB+D.par3" or "PARfileAB+CD.par3" are different from "PARfileAB+C.par3", their SetID become same (SetID = ABchild). Because child's SetID depends on the parent, different children have same Set ID. PAR3 clients cannot distinguish packets by their SetID, when they are same. This problem will confuse PAR3 clients.

I thought that using checksum of the parent's Root packet might be a mistake. By storing parent's checksum, child's SetID lost randomness. When it uses unique number instead of the checksum like normal Start Packet, SetID can be unique. To distinguish children, child's SetID should be unique also.

Yutaka-Sawada commented 2 years ago

There is at most one Root packet with a given InputSetID. The packet can be duplicated.

I have a question about Root Packet. Is it possible for a PAR3 set to have multiple different Root Packets ? I thought that there was only one kind of Root Packet in a PAR3 set, as same as there is only kind of one Start Packet in a PAR3 set, or there is only one kind of Main Packet in a PAR2 set. But, online translator says like "There are one or more Root Packets, which have the specific InputSetID.". I may miss-understand English meaning. Because my English skill is not so high, it would be good to use clear words.

I understand that there may be multiple copies of Root Packets in a PAR3 file. When there are multiple Root Packets of a same SetID, they are duplicated copy of one kind of Root Packet ? or they may be different multiple kinds of Root Packets ? If one PAR3 set may have some Roots of directory trees, the construction will be more complex than I thought.

mdnahas commented 2 years ago

Though they are distinguishable by their InputSetIDs, it needs to check all chunks in File Packet to determine either is "inside" or "outside".

Why? The "outside" one will be protecting the file containing the "inside" one. So, you only need to check the File attributes to see if the file named is the same as the file being operated on. I don't think you need to inspect the Chunk Descriptions.

Have you added any code to GitHub?

At this time, I implemented creation of Index File. Though I need to implement Verify command to check the packet data, the progression is very slow. I find many bugs everyday. I will put the current source code for Windows OS later.

Yutaka-Sawada commented 2 years ago

I don't think you need to inspect the Chunk Descriptions.

I thought that "Par inside" feature would be able to correct a misnamed file. When a PAR filename was not changed, the name is same as the "inside" file. So, it doesn't need to check Chunk Descriptions, as you mentioned. When a PAR filename was changed and there is only one File Packet under the Root Packet, it needs to check Chunk Descriptions in the File Packet. Because I made par3cmdline to keep all vital packets on memory at first, there is no problem technically. I will be tired to write complex code for all possible rare cases. At this time, priority of implementing "Par inside" feature is low.

mdnahas commented 2 years ago

Though the contents of "PARfileAB+D.par3" or "PARfileAB+CD.par3" are different from "PARfileAB+C.par3", their SetID become same (SetID = ABchild). Because child's SetID depends on the parent, different children have same Set ID. PAR3 clients cannot distinguish packets by their SetID, when they are same. This problem will confuse PAR3 clients.

I thought that using checksum of the parent's Root packet might be a mistake. By storing parent's checksum, child's SetID lost randomness. When it uses unique number instead of the checksum like normal Start Packet, SetID can be unique. To distinguish children, child's SetID should be unique also.

You're right. Good catch.

I had thought some of this through --- the recovery data packet refers to a specific Root packet --- but I forgot that there might be two different Root packets in different files with the same InputSetID. Yes, each Root packet should have a unique InputSetID.

I'll probably do it as a unique random number field in the Start Packet. So the Start packet will have: 8-byte Random value 8-byte Parent InputSetID (or zeros) 16-byte hash of Parent's Root packet (or zeroes) 1-byte Galois field length ?-byte Galois field generator With the InputSetID being a hash of the contents.

I considered making InputSetID just a random value, but I like the chaining of hashes. That way, a client cannot generate a "child" Par3 file unless it has the hash of the Root packet of the "parent" Par3 file.

mdnahas commented 2 years ago

There is at most one Root packet with a given InputSetID. The packet can be duplicated.

I have a question about Root Packet. Is it possible for a PAR3 set to have multiple different Root Packets ? I thought that there was only one kind of Root Packet in a PAR3 set, as same as there is only kind of one Start Packet in a PAR3 set, or there is only one kind of Main Packet in a PAR2 set. But, online translator says like "There are one or more Root Packets, which have the specific InputSetID.". I may miss-understand English meaning. Because my English skill is not so high, it would be good to use clear words.

That is a bad translation. "at most one" means "<= 1" which is the same as "0 or 1". The translation is bad because it makes you think it is ">= 1".

I said "<= 1" because the Root packet may or may not be in a particular Par3 file. But, yes, there should be 1 Root packet for the InputSetID, just like the Main Packet in Par2.

I will try to find simpler language.

There is only 1 Root Packet for the InputSetID. And, yes, like Par 2, you can have duplicates. So, if you find multiple Root Packets with the same InputSetID, they should be byte-for-byte copies of each other.

mdnahas commented 2 years ago

I've update the specification.

Parity_Volume_Set_Specification_v3.0.md

Yutaka-Sawada commented 2 years ago

There is only 1 Root Packet for the InputSetID.

Thank you for answer. Single root is eaiser to implement. The combination of "InputSetID and checksum of Root Packet" seems to act like a Recovery Set ID in PAR2. Though InputSetID (8-bytes) is shorter than Recovery Set ID (16-bytes), this combination (24-bytes) is longer. It may be good to distinguish different PAR3 sets each other.

In the new PAR3 spec;

The second field is also used for an "incremental backup". It holds the checksum of the parent's Root packet, or, if there is none, all zeroes.

There is a typo. This line is for "third field".

The first field is a globally unique number. The globally unique number ensures the InputSetID (explained below) is unique.

I feel that there would be no worth to store this number in Start Packet. If the value is used only to calculate InputSetID at creation time, no need to keep for later use.

Also, I think that calculating method is complex. Before making PAR3 packets, a PAR3 client must prepare a globally unique number at first. Instead of hashing body of Start Packet, just using this random number as InputSetID is simpler. When a PAR3 client has an unique number already, the randomness would be enough for InputSetID, too.

Or, is there any usage of the globally unique number in Start Packet at verification time ? Will a PAR3 clinet need to confirm the origin of InputSetID by hashing body of Start Packet ? Because Start Packet has self checksum to check damage, there may be no worth to check again.