Parchive / par2cmdline

Official repo for par2cmdline and libpar2
http://parchive.sourceforge.net
GNU General Public License v2.0
719 stars 75 forks source link

Working on major Par2 changes. Name? #130

Open mdnahas opened 5 years ago

mdnahas commented 5 years ago

Hi everyone,

I wrote the specification for Par2 a long time ago. I'm working on the code for a new version of Par. It will include:

  1. Reed-Solomon encoding with James S. Plank's correction
  2. Tornado Codes by Luby

I've spent a week learning the code. I've written unit tests for some of the existing code. The tests should allow me to modify the code without breaking it. The unit tests should be run as part of "make check" but I don't know how to add them. (I've never learned Automake). Can anyone explain how?

I also plan on writing a diff tool that can compare Par files to make sure the packets are bit-for-bit identical. I'll use this to make sure that my changes haven't affected the program's output for version 2 of the specification.

I plan on adding a "doc" directory, which will contain the old Par2 specification and the new specification.

The Tornado Codes will need a predictable pseudo-random number generator. I expect I will use a version of Linear Congruential Generator.

The big question I have is: what do we name the next version and do we want to add a new file extension? At this moment, I plan on keeping all of Par2's packets and just adding new recovery packets. This will mean that par2 clients will still be able to verify the file, but will not be able to fix it. Unfortunately, par2cmdline currently silently ignores any packet type it does not recognize. So, existing users won't know why they cannot fix it. I would normally call the new specification Par2.1 or Par3, except the name "Par3" has been used by the developer of MultiPar. Perhaps we should call it "Par4"?

When we decide on a new name, I'll push a new branch and everyone can take a look at the spec/code.

Mike

mdnahas commented 2 years ago

There are a bunch of new systems programming languages. Does anyone know about Rust or Zig or Nim?

The reference implementation is meant to be correct, accessible, and readable. C and C++ programs are definitely accessible (most programmers have C and C++ installed on their systems) and most are readable (most programmers can follow most C and non-complicated C++ programs). But it is hard to make programs in C and C++ correct. It's sooo easy to make a mistake.

I have a feeling that we will use C++, but I thought I'd ask to see if anyone has an opinion.

And, if you have time, I'm still waiting on feedback for the latest Par3 spec.

Yutaka-Sawada commented 2 years ago

Thanks Michael Nahas for PAR3 spec. When I read the text, I found 2 odd points (maybe typo).

Line 253 in "Table: Start Packet Body Contents"; 8 unsigned int The size of the Galois field in bytes.

8 byte integer seems to be over size. It may be a typo of 1 byte integer, which can represent a number up to 255.

Line 467 for "Directory packet"; The File packet has a type value of "PAR DIR\0" (ASCII).

This is a simple typo.

I came up with an idea about the content of "Creator packet". It may be good to store number of packets in the file, too. When a client reads a PAR3 file, it can determine the file is damaged by finding fewer packets. Then, the client may read another PAR3 file to find a lost packet. Though the number of packet doesn't confirm the integrity of the PAR3 file, it will be a hint at failed parsing.

animetosho commented 2 years ago

Congrats on getting so far!

A single file could reuse the same chunk. A file is a list of chunks, so there's no reason the chunks cannot be repeated

Looking at the specification for chunks, I'm trying to understand how it really differs from blocks. It seems like if you made one chunk = one block, the only downside, compared to choosing something larger, would be a larger File packet?
The only other thing I can see the chunking concept bring is the ability to cut a block midway?

(oh and the embedded Parchive thing, which I'm ignoring for now)

That's how we have the best assurance that the output data is the same as the input data

Simple specifications or code doesn't grant assurances - testing and validation do.

UNIX files often have a lot of permissions in common. But they rarely share atime, ctime, and mtime. Is it worth allowing UNIX Permissions packets to set permissions for multiple files/directories? Should I have unique atime/ctime/mtime for each?

File times are something that's commonly recognised across all file systems these days. I don't think times belong in permissions packets - for one, they aren't even related to permissions, and it goes against your aim of dedup'ing permission data, as you point out.

It would also remove the weird scheme for times on FAT file systems (there's no need to follow FAT specifications on that).

Is FAT + links okay for NTFS? If not, I'm going to ask you to design the NTFS Permissions packet.

I think what's already there is ambitious enough and I think most archiving programs don't even bother.

There are a bunch of new systems programming languages. Does anyone know about Rust or Zig or Nim?

I've only heard of them, not actually used any of them. Zig aims to be a C replacement, whilst Rust a C++ replacement. I think Nim somewhat aims at higher abstractions, given it has a GC and acts a little more like a scripting language, but can be used for systems programming.

Zig feels a little immature at this stage, but I think Rust has been around long enough to not be of any concern. I don't know enough about Nim to make any judgement, but I know it's been around for a while as well.

C/C++ is definitely more ubiquitous than any of these though. Having said that, I don't see language as that much of a concern (heck, my client is written in Javascript).

Personally, if I were writing a new client, I'd seriously look into Rust as I'm not a fan of C++, but that's only my personal opinion. If you're more confortable with using C++, I see little reason to use anything else.


support any Galois field that is a multiple of 2^8

Should probably be "power of", as 2^9 is technically a multiple of 2^8

Every byte of a Par3 file is specified.

I think you forgot to take that out.

Sparse Random Matrix

Is the RNG specified?

length of string

For these fields scattered across the spec, it should clarify that this refers to the byte length of the string, as opposed to the character length (which is what I presume to be the case).

Some file systems, like EXT4 and NTFS, support a directed acyclic graph or "DAG"

"Hard links" is probably much easier to understand by most developers than compsci graph theory terminology.

Note: Windows has 2 forms of absolute paths: "C:\dir\file.txt" and "\dir\file.txt". The second one refers to a file on the current drive.

At first, I thought the second example was a mistyped UNC path. "\dir\file.txt" isn't really an absolute path as it's relative to the current drive (as opposed to the current directory).

? {UTF-8 string} path where NUL is the separator character

I think '/' is recognised as a separator character everywhere these days, so you could just use that instead of null bytes (which is a little unusual).
Qt uses / as the standard directory separator.

The fingerprint hash is the location of the link, either a Directory packet or a Root packet.

There's a bit of a problem if multiple directories have the same hash, as there's no way to discern which directory is the correct one.

Is there a way to distinguish between symbolic and hard links?

This packet represents a node in a tree containing all the file-system specific packets.

It's not clear to me what 'node' or 'tree' is referring to here.

Also of note is that the Unix permissions has a list of directories/files, whilst FAT permissions don't reference anything. I'm not sure if one of them is a mistake.

The lowest input block index that went unused by the parent is written into the parent's Root packet.

Do you mean the child's Root packet?

mdnahas commented 2 years ago

Thanks Michael Nahas for PAR3 spec. When I read the text, I found 2 odd points (maybe typo).

Line 253 in "Table: Start Packet Body Contents"; 8 unsigned int The size of the Galois field in bytes.

8 byte integer seems to be over size. It may be a typo of 1 byte integer, which can represent a number up to 255. Yes. It was a hold-over from 8-byte aligned values. It is now 2 bytes.

Line 467 for "Directory packet"; The File packet has a type value of "PAR DIR\0" (ASCII).

This is a simple typo.

Fixed.

I came up with an idea about the content of "Creator packet". It may be good to store number of packets in the file, too. When a client reads a PAR3 file, it can determine the file is damaged by finding fewer packets. Then, the client may read another PAR3 file to find a lost packet. Though the number of packet doesn't confirm the integrity of the PAR3 file, it will be a hint at failed parsing.

I don't think it is necessary. It also means the Creator packet has to be written last. And, being a hint, it isn't that useful.

mdnahas commented 2 years ago

Congrats on getting so far! Thanks.

A single file could reuse the same chunk. A file is a list of chunks, so there's no reason the chunks cannot be repeated

Looking at the specification for chunks, I'm trying to understand how it really differs from blocks. It seems like if you made one chunk = one block, the only downside, compared to choosing something larger, would be a larger File packet? The only other thing I can see the chunking concept bring is the ability to cut a block midway?

The chunks are for recording overlapping data between files. Imagine we have two files A and B, where there is a long region of overlapping data. So A might be: "abc1234567890def" and B is "tuvw1234567890xyz". The overlapping data is "1234567890", but each file has a unique prefix and suffix. The overlapping data is at different locations in each file, because B's prefix is longer than A's. So, encoding the files as fixed-sized blocks cannot store the overlapping data. To get the optimal storage, we need to store the prefixes in something that is variable length, that is, that allows us to "cut a block midway". So, we'd end up with the chunk "1234567890" along with the chunks "abc", "def", "tuvw" and "xyz".

So, a "chunk" is meant to be a contiguous piece of unique data. And files are made by concatenating chunks together.

Chunking also lets us encode "incremental updates" more compactly. You can imagine that A was the first version of a file and B was the second version. A can be encoded in 1 chunk made of many blocks. B can then be encoded in 3 chunks, where the middle chunk reuses blocks from A's chunk. B's overlapping chunk will actually be easy to calculate, since a client can calculate a rolling hash over all of file B and find blocks that are identical to the ones in file A.

Yes, you could encode each block in its own chunk. But that really doesn't add much and it makes File packets huge. If you wanted to simplify the client, you could just make each File packet have 1 chunk description and never reuse blocks.

(oh and the embedded Parchive thing, which I'm ignoring for now)

That's how we have the best assurance that the output data is the same as the input data

Simple specifications or code doesn't grant assurances - testing and validation do.

... and its much easier to write testing and validation code for a simple specification, isn't it? ;) Also, testing and validation test what the programmer thinks the specification is, and a simpler specification is easier to understand and harder to misunderstand. With simpler things, there are fewer things to go wrong in the first place. I always choose the simplest design.

UNIX files often have a lot of permissions in common. But they rarely share atime, ctime, and mtime. Is it worth allowing UNIX Permissions packets to set permissions for multiple files/directories? Should I have unique atime/ctime/mtime for each?

File times are something that's commonly recognised across all file systems these days. I don't think times belong in permissions packets - for one, they aren't even related to permissions, and it goes against your aim of dedup'ing permission data, as you point out.

Times are not universal. Linux has the option to disable "atime" and I do it on my "/tmp" partitions. Unix times are signed values since 1970. FAT uses unsigned values since 1980. NTFS uses unsigned values since 1600. So, times are weird.

It would also remove the weird scheme for times on FAT file systems (there's no need to follow FAT specifications on that).

Maybe. I think FAT's resolution is to 2 seconds and exFAT supports 10ms. The exFAT file system stores the times in separate fields. I could put it in a single field, but then the field might be 6 bytes long, which is weird.

Is FAT + links okay for NTFS? If not, I'm going to ask you to design the NTFS Permissions packet.

I think what's already there is ambitious enough and I think most archiving programs don't even bother.

Ok. It's good to hear that.

In answering your previous question, I got reminded that NTFS's file times are different, but I think I can accept that corner case.

There are a bunch of new systems programming languages. Does anyone know about Rust or Zig or Nim?

I've only heard of them, not actually used any of them. Zig aims to be a C replacement, whilst Rust a C++ replacement. I think Nim somewhat aims at higher abstractions, given it has a GC and acts a little more like a scripting language, but can be used for systems programming.

Zig feels a little immature at this stage, but I think Rust has been around long enough to not be of any concern. I don't know enough about Nim to make any judgement, but I know it's been around for a while as well.

C/C++ is definitely more ubiquitous than any of these though. Having said that, I don't see language as that much of a concern (heck, my client is written in Javascript).

Personally, if I were writing a new client, I'd seriously look into Rust as I'm not a fan of C++, but that's only my personal opinion. If you're more confortable with using C++, I see little reason to use anything else.

I'm thinking of two programs. One is the "reference client". It should be easy to understand and correct. I don't expect it to be very fast. The other program is an open-source library and command-line program that tries to be "fast". That is, includes optimizations like multiple threads, processor-specific instructions, etc.. I definitely think we should consider Rust for the high-performance open-source program. But it might be better to use C++ for the reference implementation.

support any Galois field that is a multiple of 2^8

Should probably be "power of", as 2^9 is technically a multiple of 2^8

Yes, you're right. Changed.

Every byte of a Par3 file is specified.

I think you forgot to take that out.

Sorry. It's removed now.

Sparse Random Matrix

Is the RNG specified?

Not yet. I'm looking at 4 possibilities: Mersenne Twister, XorShift, PCG, and xoroshiro128++.

Mersenne Twister generator is old, well known and well tested. There are many libraries implementing it. It's slow and large and not especially great. But it is a standard and respected.

XorShift are fast and pretty good.

PCG generators got a lot of press. The creators made it easy to understand. It's pretty fast. The creators went a weird route and tried to sell it to programmers before publishing a research paper. It has some opponents.

xoroshiro128++ and its variants come from the opponents. It is based on the XorShift generators. Obviously, the PCG creators attack it in return.

I'm considering the specific PCG generator "PCG-XSL-RR", which has a 16-byte state and an 8-byte output. The PCG generators have a feature that you can "fast forward" in them easily. That is, if you want the 4000th random number, you don't have to generate the 3999 before it. You can go straight to number 4000. This particular one has a very long period (2^128) and generates 8-byte values at a time.

I don't think we need a 16-byte seed value in the packet. I'd like to just stick with 4 or 8 bytes. There is another PCG generator called "PCG-RXS-M-XS" that has an 8-byte state, but that also means it repeats after 2^64 values. And, given that a random matrix could contain 2^128 values, that seems small. So, I may use the "PCG-XSL-RR" generator, but only seed the generator's state's lowest 8 bytes.

If you have any expertise in this area, I'm open to changing my mind.

length of string

For these fields scattered across the spec, it should clarify that this refers to the byte length of the string, as opposed to the character length (which is what I presume to be the case).

Great suggestion! Done.

Some file systems, like EXT4 and NTFS, support a directed acyclic graph or "DAG"

"Hard links" is probably much easier to understand by most developers than compsci graph theory terminology.

Yes.

Note: Windows has 2 forms of absolute paths: "C:\dir\file.txt" and "\dir\file.txt". The second one refers to a file on the current drive.

At first, I thought the second example was a mistyped UNC path. "\dir\file.txt" isn't really an absolute path as it's relative to the current drive (as opposed to the current directory).

Windows documentation calls it an absolute path.

? {UTF-8 string} path where NUL is the separator character

I think '/' is recognised as a separator character everywhere these days, so you could just use that instead of null bytes (which is a little unusual). Qt uses / as the standard directory separator.

We had trouble in Par2 with "/". par2cmdline got it wrong and used "\" instead some of the time.

The fingerprint hash is the location of the link, either a Directory packet or a Root packet.

There's a bit of a problem if multiple directories have the same hash, as there's no way to discern which directory is the correct one.

Nice catch!!! And Crap!, that will be hard to fix.

The problem also affects permissions. There could be two empty directories with the same name but different permissions. Crap!

Is there a way to distinguish between symbolic and hard links?

Ooops! When I changed the packets, I dropped the byte that said soft/hard. Good eyes!

This packet represents a node in a tree containing all the file-system specific packets.

It's not clear to me what 'node' or 'tree' is referring to here.

Got it. I'll rewrite.

Also of note is that the Unix permissions has a list of directories/files, whilst FAT permissions don't reference anything. I'm not sure if one of them is a mistake.

FAT is the mistake. Fixed.

The lowest input block index that went unused by the parent is written into the parent's Root packet.

Do you mean the child's Root packet?

No, I mean parent's. But "is written into" is definitely the wrong way to express what I meant. I'll change it to "can be found in"

mdnahas commented 2 years ago

So, I'm thinking of how to solve the tree-of-checksums problem. That is, if we have an empty directory or empty file with the same name in two different locations, the File or Directory packets will be the same and we cannot tell them apart. That doesn't work if the files/directories have different permissions or if one directory holds a hard/symbolic link.

I think the best solution is to include the hash of the Permissions packets and/or Link packets in the File/Directory packets. That way, the File/Directory packets will be different. It means adding a counter to the File packet for "options", but it seems like the easiest solution. Otherwise, we're either adding complication either to make every File/Directory packet unique or using hash-of-paths to identify places in the directory tree.

mdnahas commented 2 years ago

I've also thought a bit about permissions. I only wanted to add basic support for those kinds of file-system-specific features. There are other programs that are better suited to encoding them (like "tar"). The FAT and UNIX both store create/modify/access timestamps and using 64-bit nanosecond timestamps is probably "good enough". FAT uses 4 permissions bits; UNIX uses 12 bits. We could store that pretty simply in each File packet.... But I'm worried about violating users' expectations. The FAT and UNIX permission bits don't really overlap. And FAT has a unique aspect: timezones. And UNIX's permission bits only make sense if we also store its unique features: owner, group, and the extended attributes, "xattr". I thought "xattr" were rare, but I checked and they are used by DropBox, Chromium and other non-rare programs. So, while we could try to do less (just times and permission bits) and use much less storage, I think it might violate users' expectations. So, I think we'll keep the current design.

Yutaka-Sawada commented 2 years ago

That is, if we have an empty directory or empty file with the same name in two different locations, the File or Directory packets will be the same and we cannot tell them apart. Otherwise, we're either adding complication either to make every File/Directory packet unique or using hash-of-paths to identify places in the directory tree.

How about setting an unique index number (8-byte integer) to each file and directory ? The index would be a serial number of searched input files at creating recovery data. Because the number is used only to distinguish each other, the order isn't a matter. The indexing isn't so complex task. File Packet and Directory Packet contain the self index number. Then, same name (or content) files are distinguishable by the index like; File[0], File[1], File[2], ...

Also, Directory Packet may contain index numbers of child files and sub-directories. Because index is 8-byte integer, it consumes smaller space than 16-byte fingerprint hash. Even when a user changes a filename or permission of an input file later, it won't affect index number. Root Packet, Link Packet, some Permissions Packets may contain unique index numbers instead of checksums of files, too.

By using index number, incremental backup will become easier. When I change filename in a File Packet, the checksum of File Packet will be changed, too. If Directory Packet contains unique index numbers instead of checksums (fingerprint hash), I don't need to update Directory Packet, which is parent of the renamed file. But, I need to change Root Packet to notice the change anyway. Hmm, this idea requires refine.

mdnahas commented 2 years ago

How about setting an unique index number (8-byte integer) to each file and directory ? ...

The tree-of-checksums is important. We want the checksum in the root packet to be a checksum of all the data sent.

It also forces correct calculation. As your example showed, without it, a directory could change without the root changing.

animetosho commented 2 years ago

So, a "chunk" is meant to be a contiguous piece of unique data. And files are made by concatenating chunks together.

I think I get it - it sounds more like they're basically virtual files. They're treated similar to how files are treated in PAR2.

Chunking also lets us encode "incremental updates" more compactly.

Your example sounds like it'd only work if B completely contains A's contents. If there's changes in the middle of the chunk, it doesn't sound like you can easily split a chunk into multiple.

Perhaps you could try splitting the single chunk on block boundaries, and accept some inefficiency (which also eliminates some benefits of chunking). The chunk hash might cause some trouble, but perhaps there's ways to work around it (since you're already scanning the original file).

With simpler things, there are fewer things to go wrong in the first place. I always choose the simplest design.

So... PAR1? With how simple the specification is, you'd be fairly confident that it'd be nigh impossible for anything to go wrong? =P

Linux has the option to disable "atime" and I do it on my "/tmp" partitions

I haven't heard of the ability to disable atime. You can disable updating atime (noatime mount option on Linux, fsutil option on Windows), but the functionality is still otherwise there.
The POSIX stat call and Windows GetFileTime call are both defined to always return an atime, so no, you can't disable it.

Unix times are signed values since 1970. FAT uses unsigned values since 1980. NTFS uses unsigned values since 1600.
I think FAT's resolution is to 2 seconds and exFAT supports 10ms. The exFAT file system stores the times in separate fields

Most programmers won't care about such details, as that's the job of the filesystem driver.
Unless the goal is to have PAR3 clients bundle their own filesystem drivers, you'd generally expect clients to use the OS interfaces to retrieve times, which typically abstract away such details.

One is the "reference client". It should be easy to understand and correct. I don't expect it to be very fast.

If the aim is two implementations, perhaps a high level scripting language like Python suits this one here.
Not having to worry about memory management, like you would in C++, can reduce complexity and make understanding it easier.

Windows documentation calls it an absolute path.

"absolute path from the root of the current drive" or "path relative to the root of the current drive" - just seems like word play to me.
In either case, it's not a fully absolute path, as you can't reference it without knowing what the current drive is.

We had trouble in Par2 with "/". par2cmdline got it wrong and used "" instead some of the time.

That sounds more like a bug in par2cmdline than an issue in the specification?

mkruer commented 2 years ago

Can we extrapolate file names and permissions a bit this a bit further? The names of the file and the permission on the file should be a separate DataStream. Sometimes users change the filename or permission and while it is nice to be able to restore those names and permissions, it is not always necessary or desirable. Anyone that has had to deal with RoboCopy are RSync can tell you that moving files around stripping out permissions and other attributes becomes necessary at some point. One issue that has popped up from time to time is I would create a PAR set and then later, rename the files. With the PAR2 implementation, the file was considered missing and I was left with two options, one have par rebuild the file or keep the original name.

In a perfect world, the names would be irreverent, and when looking at reconstructing data it would look for matching files first regardless of name.

The to speed up the process method

Match by name > Match by file size > Match additional select files (broken into blocks)

If the name for the file has changed then just allow for an update to the metadata steam or keep that information and add it to a history of some sort.

mdnahas commented 2 years ago

If the name for the file has changed then just allow for an update to the metadata steam or keep that information and add it to a history of some sort.

Yes. That is part of the Par3 design. It is described as an "incremental backup" in the draft specification. It allows a Par3 file to reuse the data and metadata from an existing Par3 file. The new Par3 file would contain the new file names, but reuse the recovery data from the old Par3 file.

mdnahas commented 2 years ago

Your example sounds like it'd only work if B completely contains A's contents. If there's changes in the middle of the chunk, it doesn't sound like you can easily split a chunk into multiple.

So, let version 1 of a file be: "abcdefghij". If the blocksize is 2 bytes, It would be encoded as a single chunk with length=10 and firstblockindex=0. The blocks would be: 0: "ab", 1:"cd", 2:"ef", 3:"gh", 4:"ij"

Then version 2 of a file inserts data in the center: "abcde0123fghij". It would be encoded as 3 chunks:
{length=5, firstblockindex=0, tailblockindex=2, tailoffset=0} {length=5,firstblockindex=5, tailblockindex=2, tailoffset=1} {length=4, firstblockindex=3} The blocks would be: 0: "ab", 1:"cd", 2:"ef", 3:"gh", 4:"ij", 5: "01", 6: "23".

That worked out better than I ever expected! I hadn't realized that the "tail"s could both reused the data in block 2!

With simpler things, there are fewer things to go wrong in the first place. I always choose the simplest design.

So... PAR1? With how simple the specification is, you'd be fairly confident that it'd be nigh impossible for anything to go wrong? =P

Exactly. I wouldn't have done Par2, except we wanted more capabilities. The trick is to add the features we need without adding (too much) complexity.

I haven't heard of the ability to disable atime. You can disable updating atime

Good point.

Unix times are signed values since 1970. FAT uses unsigned values since 1980. NTFS uses unsigned values since 1600. I think FAT's resolution is to 2 seconds and exFAT supports 10ms. The exFAT file system stores the times in separate fields

Most programmers won't care about such details, ...

Until the data in their filesystem cannot be stored inside a Par3 file. Then they care a lot.

I agree with you that it is extremely unlikely that we'll hit a case where a 64-bit signed UNIX time would cause a problem on a FAT or NTFS filesystem. I thought about making one structure for all file permissions, but then I ran into all the details for overlaps. I think it is easiest to have two different structures for UNIX and FAT and to keep those structures as close to the target file systems as possible.

One is the "reference client". It should be easy to understand and correct. I don't expect it to be very fast.

If the aim is two implementations, perhaps a high level scripting language like Python suits this one here. Not having to worry about memory management, like you would in C++, can reduce complexity and make understanding it easier.

A good idea. I'd have to check Python3's static typing features. When caring about correctness, I like static typing.

We had trouble in Par2 with "/". par2cmdline got it wrong and used "" instead some of the time.

That sounds more like a bug in par2cmdline than an issue in the specification?

Yes. But if someone screwed it up, we can try to make the next specification harder to screw up. There are quite a few features in the specification designed to prevent screw ups. (Like strings not being NUL terminated.)

Yutaka-Sawada commented 2 years ago

I feel that storing Last Access Time will be worthless. Because a PAR client accesses input files to make PAR3 files, their stored Last Access Time may be same as the creation time of PAR3 files. When input files are verified by using the PAR3 files, a PAR client accesses them and updates Last Access Time every time. So, it's useless to recover their Last Access Time anyway. Is there a good usage of stored Last Access Time ?

Also, Last Access Time seems to be very low resolution on Windows OS from Microsoft's document;

For example, on NT FAT, create time has a resolution of 10 milliseconds, write time has a resolution of 2 seconds, and access time has a resolution of 1 day (really, the access date). On NTFS, access time has a resolution of 1 hour.

I found a typo in a table title.

Line 613 for "FAT Permissions Packet"; Table: UNIX File Packet Body Contents

mdnahas commented 2 years ago

I agree that saving last access time is strange. But it seems that some people care about this when using "tar": https://stackoverflow.com/questions/38248993/why-atime-is-not-preserved-in-tar

I think we should have a place to store the value. I'll add a comment about how clients can use it (or not).

mdnahas commented 2 years ago

I've got a new draft of the specification.

Par3_spec.txt

animetosho commented 2 years ago

That worked out better than I ever expected! I hadn't realized that the "tail"s could both reused the data in block 2!

But how could an application detect such? I don't see how it could figure that the 'e' is identical, because the lowest granularity hash you have is the block hash.

Exactly. I wouldn't have done Par2, except we wanted more capabilities

...well I'm not sure I'd call PAR1 flawless, despite its simplicity.

Anyway, I'd argue that removing hashes actually simplifies the specification and the code. I think the problem here is that you don't feel it's safe, which seems to be more a case of 'simple for a designer to think about' as opposed to 'simple specification'.

I think it is easiest to have two different structures for UNIX and FAT and to keep those structures as close to the target file systems as possible.

My opinion is actually the exact opposite. I don't think it makes much sense to tailor specifically to filesystems - the fact that you've walked back on NTFS is a good example of why. Similarly, that you have a generic Unix "filesystem" shows that you really don't want to target specific filesystems.

The vast majority of Windows users will be using NTFS. If you were to take FAT literally, you'd be excluding most Windows users.

The whole point of filesystem abstractions that OSes give to applications, is to ensure that developers don't have to care about the underlying filesystem. Going against these abstractions generally isn't fun.

Tailoring to specific filesystems also means that when a new filesystem comes out, your design may no longer fit well with it. It also has problems where the application may not be able to actually get such information (e.g. network mounted filesystems, FUSE/virtual filesystems etc), and such tight coupling with underlying details is often considered poor design.

What would make more sense would be to target OS APIs instead of filesystem implementations. Make the distinction Unix/Windows, as opposed to FAT.
Still, I think it makes the most sense to unify times - cross platform applications only have to deal with one time format (instead of two), and many higher level languages don't directly expose the OS APIs.

Yes. But if someone screwed it up, we can try to make the next specification harder to screw up. There are quite a few features in the specification designed to prevent screw ups. (Like strings not being NUL terminated.)

NULL terminated strings are just a bad idea in general - not using them makes a lot of sense, and they're rightfully avoided in many places.

I get the attraction with avoiding bugs, but null separators for paths is just a weird approach that's not being done elsewhere, and has little tangible benefit outside of a specific implementation. I'm also not sure it really makes bugs harder - it just feels like kicking the can down the road.
For example, if a decoder gets the path "dir_1\0dir_2/dir_3", it's not clear to me exactly how that should be handled (the spec doesn't seem to say it's invalid), and I could easily see different applications doing different things.

mdnahas commented 2 years ago

Tailoring to specific filesystems also means that when a new filesystem comes out, your design may no longer fit well with it. .... What would make more sense would be to target OS APIs instead of filesystem implementations.

But APIs change too. Windows didn't have symbolic links, but then it added them.

Also, APIs behave differently based on the filesystem underneath. Windows supports FAT, FAT32, exFAT and NTFS and ReFS (whatever that is). So, we'd have to encoded every possible return value from the API. And that can change when a new filesystem is introduced.

I have a very good idea what is in FAT, exFAT, and EXT4. I have documents that tell me the fields and their formats. I don't have a good idea of how Linux's low-level file system API works, let alone how its return values change based on every file system that could be mounted.

The good news is, if we decide that APIs are a better way to go, it will be easy to add. :)

I get the attraction with avoiding bugs, but null separators for paths is just a weird approach that's not being done elsewhere, and has little tangible benefit outside of a specific implementation. I'm also not sure it really makes bugs harder - it just feels like kicking the can down the road. For example, if a decoder gets the path "dir_1\0dir_2/dir_3", it's not clear to me exactly how that should be handled (the spec doesn't seem to say it's invalid), and I could easily see different applications doing different things.

Well, the spec will have a list of file/directory names that are not portable and should be avoided. "dir_2/dir_3" is not portable. It is an invalid file/directory name on Windows, Linux, and MacOS. "dir_2\dir_3" is also not portable. It is valid on Linux and MacOS, but not on Windows.

Remember, it will be easy to copy "dir\0dir1\0dir2\0foo.txt" to a char buffer and then replace the '\0' with the appropriate slash. And because the '\0' has to be replaced, people will need to think about if the code is running on UNIX or Windows.

mdnahas commented 2 years ago

But how could an application detect such? I don't see how it could figure that the 'e' is identical, because the lowest granularity hash you have is the block hash.

So, let's assume the blocksize is 2 bytes and the encoding client's first file has the contents "abcdefghij". It will encoded it as a single chunk with length=10 and firstblockindex=0. The blocks would be: 0: "ab", 1:"cd", 2:"ef", 3:"gh", 4:"ij"

What should the client do when the second file has contents: "abcde0123fghij"?

The client can do a rolling hash of the entire new file and identify duplicate input blocks from the first file. The client would find that "ab", "cd", "gh" and "ij" have already been assigned to block indices 0, 1, 3 and 4. When the client tries to package those into chunks, it is easy to see that "abcd" is the same in both files, so the client can check if the chunk continues into the next block. So, identifying "e" at the start of block 2 is easy. That makes the first chunk.

The client can also see that "ghij" is the same in both files. It can check if the data before "g" is also the same in both files. In this case, it is. Now, if that data is long enough, it can be packaged as the tail of the chunk preceding "ghij". In this case, it is long enough, so "f" becomes the tail for the new data in blocks 5 and 6.

I think that approach will work.

To perform the above algorithm, a client would have to keep around the rolling hash and fingerprints of all the blocks it has already seen. That doesn't seem like too much memory. When processing a new file, I don't think the client needs to run a rolling hash over the whole file before matching data with existing blocks. I think it only has to keep 2 blocks-worth of data in memory at any time --- enough space to find a duplicate block at any offset. When a sequence of duplicate block is discovered, the client will have to compare the data preceding and following it for the "tails". That would require reloading the older blocks back into memory for the comparison. That is the part that could consume a lot of time, if there are a lot of duplicate blocks.

So, it might be a little costly to do, but I think it is possible. The good news is that the runtime is slow only when there is a lot of compression due to duplication.

Yutaka-Sawada commented 2 years ago

About Directory Packet in the latest PAR3 spec;

4 unsigned int number of options (a.k.a. permissions and links) 16*? fingerprint hash checksums of packets for options The next two fields hold the number of options and the checksums of the packets holding those options.

I posted an idea of unique index number (counter) of input files ago. My idea might be simpler than this complex items in new Directory Packet, though each File Packet becomes 8-byte larger by storing a counter.

File tree design decision.

  1. Force uniqueness by adding a counter, so that Files and Directories are never duplicates.
    • "counter" cannot be a hash of the path, because we'd need to deal with changing of directory names.

Thank you for considering my idea. I rethink my idea. While counter cannot replace hash, checksums of File Packets would be distinguishable. By putting counter in every File Packets, their checksums will become different each other. 1-bit difference in a File Packet content would result in absolutely different 16-byte fingerprint hash. Even when their filename and data bytes are same, unique index number can cause different hash value. Thus, the problem of file tree will disapper. Directory Packet is enough to store checksums (16-byte fingerprint hash) of child File Packets and Directory Packets, because these checksums must be differ by the included counter. Even if two checksums of packets happens to be same, the hash collision can be solved by exchanging each counter. When a client can distinguish input files by their checksums, the construction of PAR3 file will become simpler.

About Mr. Anime Tosho's question;

But how could an application detect such? I don't see how it could figure that the 'e' is identical, because the lowest granularity hash you have is the block hash.

I agree with Mr. Anime Tosho. Unless I can compare bytes of both old and new files, it's difficult to find such splitted flagments: 'e' and 'f' from "abcde0123fghij". As Michael Nahas wrote, I can find "ab" and "cd" by using slice's checksum. Then, I can suspect the starting offset of "ef". But, it's impossible to determine the boundary of 'e' and "0123f" in "e0123f", because block size is larger than 2-bytes normally. They will be treated as inserted 3 blocks; "e0", "12", and "3f".

For example, block size may be 6-bytes like below; "aaabbbcccdddeeefffggghhhiiijjj" It would be encoded as a single chunk with length=30 and firstblockindex=0. The blocks would be: 0: "aaabbb", 1:"cccddd", 2:"eeefff", 3:"ggghhh", 4:"iiijjj"

Then, 12 bytes are inserted in the center: "aaabbbcccdddeee000111222333fffggghhhiiijjj" The first 2 blocks can be found: 0: "aaabbb", 1:"cccddd" The last 2 blocks can be found: 3: "ggghhh", 4:"iiijjj" But, it's impossible to pick "eeefff" from "eee000111222333fff". Case of tail length = 1: "e", "ee0001", "112223", "33fff" Case of tail length = 2: "ee", "e00011", "122233", "3fff" Case of tail length = 3: "eee", "000111", "222333", "fff" Case of tail length = 4: "eee0", "001112", "22333f", "ff" Case of tail length = 5: "eee00", "011122", "2333ff", "f" How to determine which tail length is correct ? Or else, it will think that 3 blocks were inserted: "eee000", "111222", "333fff"

It would be encoded as 3 chunks: {length=12, firstblockindex=0} {length=18, firstblockindex=5} {length=12, firstblockindex=3} The blocks would be: 0: "aaabbb", 1:"cccddd", 5:"eee000", 6:"111222", 7:"333fff", 3:"ggghhh", 4:"iiijjj" (block no.2 in old file is missing.)

mdnahas commented 2 years ago

About Directory Packet in the latest PAR3 spec;

Thank you for considering my idea. I rethink my idea. While counter cannot replace hash, checksums of File Packets would be distinguishable. ....

Yes, if we include a counter, then the hashes become unique. That means the "tree of hashes" would be an actual tree and not share any branches. But I thought the point of the counter was to save space. That is, the Directory and Root packets would store counters that are 4-bytes or 8-bytes in size, rather than 16-byte (or more) packet hashes.

When I chose the alternative, with options counters + options hashes in the File/Directory/Root packets, it fixed two problems. One was that it got rid of the awkward "Permissions Grouping packet", because the permissions were now part of the tree-(not-actually-a-tree)-of-hashes. The other is that it was okay for the "tree of hashes" to reuse branches of the tree, because we weren't using packet hashes to uniquely identify nodes in the tree. Personally, I think getting rid of the "Permissions Grouping packets" was a big win.

About Mr. Anime Tosho's question;

But how could an application detect such? I don't see how it could figure that the 'e' is identical, because the lowest granularity hash you have is the block hash.

I agree with Mr. Anime Tosho. Unless I can compare bytes of both old and new files, it's difficult to find such splitted flagments: 'e' and 'f' from "abcde0123fghij". As Michael Nahas wrote, I can find "ab" and "cd" by using slice's checksum. Then, I can suspect the starting offset of "ef". But, it's impossible to determine the boundary of 'e' and "0123f" in "e0123f", because block size is larger than 2-bytes normally. They will be treated as inserted 3 blocks; "e0", "12", and "3f".

For example, block size may be 6-bytes like below; "aaabbbcccdddeeefffggghhhiiijjj" It would be encoded as a single chunk with length=30 and firstblockindex=0. The blocks would be: 0: "aaabbb", 1:"cccddd", 2:"eeefff", 3:"ggghhh", 4:"iiijjj"

Then, 12 bytes are inserted in the center: "aaabbbcccdddeee000111222333fffggghhhiiijjj". ....

Yes, that is correct. You cannot identify fractions-of-blocks without reloading the block data. I think you can identify "tails" by recognizing that they are a continuation of blocks that you've already recognized.

I said before that I think there is an algorithm that works with a 2*blocksize window. The window actually has to be 2*blocksize-1 bytes long. The algorithm is hard to describe, so I'll work through your example with it. I'm going to take this slowly, so sorry if it feels tedious at any point.

I think we can make it work with a 2*blocksize-1 window that scans the new file. So assume we've seen the first file and created the blocks as you described above. Since the blocksize is 6, the window size is 2*6-1 = 11 bytes. When we start to slide the 11-byte window over the new file, its contents are:

W=[aaabbbcccdd]
The client scans the window using the rolling hashes of all the existing blocks (0 to 4). For any matches, the confirms that it is the existing block using the existing fingerprint hashes. When the client does that on this window, if finds that "aaabbb" matches block 0. So that becomes the start of a preliminary chunk {length=6+?, firstblockindex=0, tail?} and the client moves the window beyond that block.

W=[cccdddeee00] The client scans the new window with the existing hashes. The first complete block it finds is "cccddd". That is a continuation of the existing preliminary chunk, so it updates the preliminary chunk to {length=12+?, firstblockindex=0, tail?} and moves the window.

W=[eee00011122] The client scans this window with the existing hashes. It finds no matches. There is a preliminary chunk that it has been working on, so the client checks if it has a tail. The last block in the preliminary chunk was block 1, so the client checks if the start of the window matches the start of the next block, which is block 2. This requires loading block 2 back into memory, which is expensive, but is necessary to find the tail. When it compares the start of the window to the start of block 2, it finds that "eee" is common to both. The client declares that that is the tail of the preliminary chunk. The client finalizes the chunk as {length=15, firstblockindex=0, tailblockindex=2, tailoffset=0}. The client moves the window past the tail.

W=[00011122233] The client scans and finds no matches with existing blocks. It has no preliminary chunk, so it start a new preliminary chunk with a new input block. {length=6+?, firstlblockindex=5, tail?} The client stores the rolling hash and fingerprint of the new block. The client moves the window past the new block.

W=[222333fffgg] The client scans and finds no match. The client declares this a new block. It adds it to the end of the preliminary chuck and gets {length=12+?, firstblockindex=5, tail?}. It moves the window.

W=[fffggghhhii] The client scans and finds an existing block, block 3, at offset 3. That existing block will become part of a new chunk, so the client has to finalize the existing preliminary chunk. The question is, what does the client do with the "fff"? Before writing "fff" into a new block, the client can recognize that "ggghhh" is in block 3 and that "fff" might be present at the end of block 2. So, it loads block 2 into memory and compares the end of block 2 to "fff". When it finds a match, the client finalizes the existing preliminary chunk as {length=15, firstblockindex=5, tailblockindex=2, tailoffset=3}. (If the end of block 2 did not match "fff", the tail would have been written into the first 3 bytes of block 7 and the finalized chunk would be {length=15, firstblockindex=5, tailblockindex=7, tailloffset=0}.) The client can now start a new preliminary chunk with the block it found: {length=6+?, firstblockindex=3, tail?}. It moves the window after the found block.

W=[iiijjj] The scan says this is block 4. It is a continuation of the existing preliminary block. That becomes {length=12+?, firstblockindex=3, tail?}. The client moves the window.

W=[] This is the end of file. The client finalizes the preliminary block as {length=6, firstblockindex=3}

So the chunks are: {length=15, firstblockindex=0, tailblockindex=2, tailoffset=0} {length=15, firstblockindex=5, tailblockindex=2, tailoffset=3} {length=6, firstblockindex=3}

I'm sorry if that was slow and repetitive, but I hope it was clear. I think that that is an acceptable algorithm for deduplication and chunking. It does require reloading some blocks into memory after they've been processed. I think that has to happen with any good algorithm. The good news is that reloading only happens after an existing block has been identified, so if the process is slow, it is because deduplication is actually happening. That is, the user gets something for the delay.

The above algorithm would not work in parallel. There is a parallel algorithm, but it requires three passes over each file. In the first pass, for every file, you compute the rolling hash and fingerprint hash at blocksize offsets. (That is, as if the files were each a single chunk and chopped into blocksize pieces.) The second pass would be similar to the single-threaded algorithm: running a rolling hash over every file and identifying the matches to the blocks found in the first pass. The second pass would also identify the "tails" before and after the matching blocks. After the second pass, the client would run a fast single-threaded calculation where it decided what data would be put into which input blocks. After that, there would need to be a third pass, where the recovery data was calculated using the input blocks and their indexes.

Does that make sense?

Rhialto commented 2 years ago

On Tue 07 Dec 2021 at 02:19:41 -0800, Yutaka-Sawada wrote:

I agree with Mr. Anime Tosho. Unless I can compare bytes of both old and new files, it's difficult to find such splitted flagments: 'e' and 'f' from "abcde0123fghij". As Michael Nahas wrote, I can find "ab" and "cd" by using slice's checksum. Then, I can suspect the starting offset of "ef". But, it's impossible to determine the boundary of 'e' and "0123f" in "e0123f", because block size is larger than 2-bytes normally. They will be treated as inserted 3 blocks; "e0", "12", and "3f".

This is true, but the fact that finding this is difficult may not be so important. The specification can work like it is with some video compression algorithms: they specify the bitstream, and how to decompress it. They have no doubt thoughts about how to create the bitstream, but this is not part of the specs. If somebody thinks of a smarter way to generate a bitstream, they can do that.

-Olaf. -- ___ "Buying carbon credits is a bit like a serial killer paying someone else to \X/ have kids to make his activity cost neutral." -The BOFH @.***

animetosho commented 2 years ago

But APIs change too. Windows didn't have symbolic links, but then it added them.

Filesystems change over time too - NTFS didn't support symbolic links prior to version 3.1.

As for APIs, they actually generally don't change, as doing so would break backwards compatibility. Windows is big on backwards compatibility, and Linus is big on never breaking userspace.
Of course, new APIs come out all the time, but you can never design anything that'll somehow take advantage of every possible new future change.

Windows supports FAT, FAT32, exFAT and NTFS and ReFS

Don't forget ISO9660 and UDF!

So, we'd have to encoded every possible return value from the API

Fortunately, every possible return value is defined and well documented, so that shouldn't be an issue.
For example, Windows' GetFileTime documents that times are returned as 64-bit integers - this is something that can never change, otherwise it would break compatibility with a lot of Windows applications.

I don't have a good idea of how Linux's low-level file system API works

I'm not sure what your point is here, but generally APIs are much better documented than filesystems, so it doesn't sound like a common thing amongst developers (particularly since filesystem internals are typically of interest to a very small audience, whilst APIs are more relevant to the majority of developers).

Perhaps a good place to start on the API would be the POSIX stat call.

Also, APIs behave differently based on the filesystem underneath

Generally this shouldn't happen, assuming the underlying filesystem supports the requirements of the API. If not, it's the job of the abstraction layer to supply sane defaults.
None of this is typically a concern of userspace applications. If the filesystem driver happens to be buggy, it's expected that all userspace applications inherit this bug.

At the end of the day, even if you come up with some perfect representation of all filesystems, I don't see how you could expect anyone to actually implement it. Applications must go through the OS API regardless, so if that API is mangling the results or being unpredictable, it's not like any program can actually do anything about it.

Well, the spec will have a list of file/directory names that are not portable and should be avoided

Unfortunately, 'avoided' doesn't mean it can't happen. A decoder still has to do something about it when it encounters it.

"dir_2/dir_3" is not portable. It is an invalid file/directory name on Windows, Linux, and MacOS

Actually, it's considered valid under all three OSes. It does refer to two objects as opposed to one, but that doesn't stop it being valid and accepted by all the APIs.

And because the '\0' has to be replaced, people will need to think about if the code is running on UNIX or Windows.

Windows aliases '/' to '\', so just blindly replacing '\0' with '/' actually works just fine - no thinking required.

It does require reloading some blocks into memory after they've been processed

But where would you reload the blocks from?

If it's from disk, that would require both the old and new files to actually be present. If we're talking about updates, there's a good chance that the old file is no longer available, because the new file would've overwritten the old.
You could possibly limit this to just incremental backup scenarios, where the old data must be present. Whilst it sounds doable now, I'm not sure how a client would actually implement it, as it sounds like it'd need to have intricate knowledge of how the incremental backup system being used, actually works.

The only other option would be to recompute the old data from the recovery that's already present. But this is both slow and only works if there aren't many changes.

keep around the rolling hash and fingerprints of all the blocks it has already seen. That doesn't seem like too much memory
I think it only has to keep 2 blocks-worth of data in memory at any time

Depends if the aim is still to design for 2^64 blocks + 2^64 bytes/block.
I guess I wouldn't be too concerned here, as there may be workarounds.

I think that that is an acceptable algorithm for deduplication and chunking

It feels a little too simplistic for what it may encounter, for example:

But it's the creator's job to find a good algorithm to handle these in a desirable manner (or choose not to handle some cases), so not really an issue with the spec.
It may be that some cases just aren't supportable, which I think is fine as I don't think PAR should really aim to be a powerhouse at dedupe.

Yutaka-Sawada commented 2 years ago

The other is that it was okay for the "tree of hashes" to reuse branches of the tree, because we weren't using packet hashes to uniquely identify nodes in the tree.

Oh, I see.

Does that make sense?

Thank you for answer. By restoring a splitted (or missing) block temporary, it will be possible.

mdnahas commented 2 years ago

The specification can work like it is with some video compression algorithms: they specify the bitstream, and how to decompress it. They have no doubt thoughts about how to create the bitstream, but this is not part of the specs. If somebody thinks of a smarter way to generate a bitstream, they can do that.

Exactly. Great comment, @Rhialto

mdnahas commented 2 years ago

So, we'd have to encoded every possible return value from the API

Fortunately, every possible return value is defined and well documented, so that shouldn't be an issue. For example, Windows' GetFileTime documents that times are returned as 64-bit integers - this is something that can never change, otherwise it would break compatibility with a lot of Windows applications.

I don't have a good idea of how Linux's low-level file system API works

I'm not sure what your point is here, but generally APIs are much better documented than filesystems, so it doesn't sound like a common thing amongst developers (particularly since filesystem internals are typically of interest to a very small audience, whilst APIs are more relevant to the majority of developers).

I know how to read filesystem documentation and it tends to have everything in one place and says "this is the disk layout of the data", which makes it easy for the specification to copy their layout.

I have not used the low-level file system API of Linux, MacOS, nor Windows. And understanding them is not just a "copy this data structure", but also dealing with all the possible ways to call the function and its possible return values. We can definitely make the specification work with an API's interface, but I feel like there will be a lot more learning and more union data structures and variable length storage.

Perhaps a good place to start on the API would be the POSIX stat call.

So, that man page describes 4 different functions, one of which takes 3 flags, and 13 possible error values. This is the complication that I'm talking about. I'm not saying it is impossible to understand and encode in the specification, just that it is complicated.

Also, APIs behave differently based on the filesystem underneath

Generally this shouldn't happen, assuming the underlying filesystem supports the requirements of the API. If not, it's the job of the abstraction layer to supply sane defaults. None of this is typically a concern of userspace applications. If the filesystem driver happens to be buggy, it's expected that all userspace applications inherit this bug.

Yes, but what happens when I mount a exFAT filesystem under Linux? For example, when someone puts their backup on an exFAT formatted USB drive. They may care about permissions --- especially that the files are marked "read-only". Do we store the file attributes that Linux makes up because the exFAT filesystem doesn't support all the features of the Linux file system interface?

At the end of the day, even if you come up with some perfect representation of all filesystems, I don't see how you could expect anyone to actually implement it. Applications must go through the OS API regardless, so if that API is mangling the results or being unpredictable, it's not like any program can actually do anything about it.

I think this is the best argument for using an API-based layout. And it's a really good one.

The counterargument is what to do if the same filesystem is mounted under different APIs? If someone puts their backup on an exFAT formatted drive, which could be mounted under Windows or Linux, what should we do? Do we say "the Par3 file was created on Windows, so we stored the Windows permissions" or say "the Par3 file was created for an exFAT filesystem and we stored the exFAT permissions"?

Since there are good arguments on both sides, what's the right answer?

Well, each client will be running with an API and a file system. The encoding client will use its API to access the files. So, the information it has available to write in the Par3 file is the intersection of the information available from the file system with the information passed through the API. The decoding client is in a similar situation with its own API, but it will be using the same file system as the encoding client. Yes, the input files could be on a different file system, but if the user cares about specific permissions, the decoding client will be run on a file system that can encode those permissions.

So, the information transmitted from encoding client to decoding client is the intersection of the information in the file system, with the information readable by the encoding client's API and information readable&writable by the decoding client's API. Since the APIs can have more information than what is stored in the file system (e.g., Linux making up permissions for files on exFAT) and there are two different APIs that would need to be translated to each other, I think the best approach is use the file system's format to store information in the Par3 file.

Does everyone accept that argument?

It will mean that encoding clients will have to determine what the file system is. It might also mean that we need to include a mask or state default values in the Permissions packets spec, in case some data stored in the file system is not visible via an API. E.g., I doubt the "system" and "archive" bits of the FAT file system are visible via the Linux API.

Well, the spec will have a list of file/directory names that are not portable and should be avoided

Unfortunately, 'avoided' doesn't mean it can't happen. A decoder still has to do something about it when it encounters it.

"dir_2/dir_3" is not portable. It is an invalid file/directory name on Windows, Linux, and MacOS

Actually, it's considered valid under all three OSes. It does refer to two objects as opposed to one, but that doesn't stop it being valid and accepted by all the APIs.

I said it is not a valid filename / directoryname. And I was right. What you mean to say is that it is a valid path. A file/directoryname and a path are two different things.

The correct behavior would be to change the filename/directoryname to one that is valid on the decoding client's system. E.g., change "dir_2/dir_3" to "dir_2_dir_3" or something else.

And because the '\0' has to be replaced, people will need to think about if the code is running on UNIX or Windows.

Windows aliases '/' to '\', so just blindly replacing '\0' with '/' actually works just fine - no thinking required.

That wasn't where the no thinking happened. It was writing a valid Windows path into a Par2 packet.

It does require reloading some blocks into memory after they've been processed

But where would you reload the blocks from?

If it's from disk, that would require both the old and new files to actually be present. If we're talking about updates, there's a good chance that the old file is no longer available, because the new file would've overwritten the old. You could possibly limit this to just incremental backup scenarios, where the old data must be present. Whilst it sounds doable now, I'm not sure how a client would actually implement it, as it sounds like it'd need to have intricate knowledge of how the incremental backup system being used, actually works.

Yes, the algorithm that I stated will only work when the input blocks can be read into memory. That is the case when you're making a new Par3 file. You are correct that when doing an incremental backup, the parent's input blocks may not be available to be loaded into memory. In that case, my algorithm will not be able to match "tails" with portions of the parent's input blocks. That will make reuse of data less. I don't see a good way around that.

The only other option would be to recompute the old data from the recovery that's already present. But this is both slow and only works if there aren't many changes.

You're right. And I certainly don't think most client writers would bother to do that.

keep around the rolling hash and fingerprints of all the blocks it has already seen. That doesn't seem like too much memory I think it only has to keep 2 blocks-worth of data in memory at any time

Depends if the aim is still to design for 2^64 blocks + 2^64 bytes/block. I guess I wouldn't be too concerned here, as there may be workarounds.

I think that that is an acceptable algorithm for deduplication and chunking

It feels a little too simplistic for what it may encounter, for example:

* two insertions in one block

* insertion + deletion + changed bytes

* portions of blocks being moved around

* portions of blocks being duplicated

* or some combination of the above

I agree that it isn't perfect. The algorithm requires that every duplication include at least 1 complete input block. Moreover, it only identifies "tails" when they are in the input blocks before or after the sequence of repeated input blocks. But I think the algorithm is easy to implement and does a lot with a window of just 2*blocksize-1 bytes.

But it's the creator's job to find a good algorithm to handle these in a desirable manner (or choose not to handle some cases), so not really an issue with the spec. It may be that some cases just aren't supportable, which I think is fine as I don't think PAR should really aim to be a powerhouse at dedupe.

The specification says that we try to be the best in redundancy and splitting. Anything else --- permissions, deduplication, compression, etc. --- we're just trying to have something that works most of the time for the ease of our users. If they really care about shrinking the size of things, there are probably much better deduplication (and compression) programs out there.

animetosho commented 2 years ago

I have not used the low-level file system API of Linux, MacOS, nor Windows

What API have you used to interact with files before? I presume it's a higher level one, which I think is fine to use as a basis, as they're ultimately built on top of OS APIs and will share many aspects of them.
I also don't think low level APIs should be the only consideration, as not everyone deals with low level APIs (such as yourself). The low level APIs are fine as a baseline though.

And understanding them is not just a "copy this data structure", but also dealing with all the possible ways to call the function and its possible return values

I can't see any case where the latter two actually matter to the design of a spec though. Whilst I'm not sure copying the structure verbatim is necessarily what you want, it alone should be a good basis.

So, that man page describes 4 different functions, one of which takes 3 flags, and 13 possible error values

The different functions do pretty much the same thing. The distinction is mostly for a client to consider - for example, whether it should get info from a file path, or from a file handle, or whether they wish to follow symlinks.
Similarly, the errors are for a client to be concerned about - for example, how to handle an 'access denied' error is not a detail that the spec needs to consider.

I get that you may be unfamiliar with it, but I think it's much more relevant than filesystem specifics. For one, most programmers are more familiar with the former, and for another, unifying filesystems under a consistent interface is already a solved problem - there's no need to invent some new way to handle multiple filesystems (well, unless you believe you can do a much better job).

what happens when I mount a exFAT filesystem under Linux? For example, when someone puts their backup on an exFAT formatted USB drive. They may care about permissions --- especially that the files are marked "read-only". Do we store the file attributes that Linux makes up because the exFAT filesystem doesn't support all the features of the Linux file system interface?

Yes - I would expect most applications to do this. An ambitious application can try to decode the FAT properties and use that instead, or maybe just detect that it isn't a Unix native system and not store Unix properties (assuming it can query the filesystem type).

The spec doesn't have to strictly adhere to Unix/Windows APIs, but rather be appropriate for them, which means that a client could choose to use the Windows form even when running under Linux, for example.

Do we say "the Par3 file was created on Windows, so we stored the Windows permissions" or say "the Par3 file was created for an exFAT filesystem and we stored the exFAT permissions"?

Why not allow both? The client can decide what set of permissions it wishes to store. The spec already needs to allow for files spread across different filesystem types anyway.

I said it is not a valid filename / directoryname. And I was right. What you mean to say is that it is a valid path. A file/directoryname and a path are two different things.

That's true, but I don't believe APIs make that distinction.
Your point of using a NULL separator was that par2cmdline forgot to fix the path, so this approach makes it less likely. Unfortunately, all it really does is move the potential mistake from the encode side to the decode side.

In my opinion, the best way to avoid these sorts of bugs is to develop test scripts which check all these corner cases, that can be used against any client to verify correctness.

mdnahas commented 2 years ago

I think we agree that a common usage will be Par3 protecting data on a USB drive or SD card formatted with exFAT. I don't know how Windows handles exFAT but, on Linux, every file has to have all the Linux file permissions. So, with files on an exFAT file system, the OS manufactures most of the permissions. They are set as an option when mounting the drive. So, Linux's normal file API will return permissions that are not stored with the files. If that same USB drive is mounted on another Linux system with different mounting options, those permissions are going to be different and the Linux system cannot change them, because they cannot be written to the disk --- they come from the mounting options for the disk. So, on what we agree will be a common example, we have a encoding client writing options that don't exist into a Par3 file and another a decoding client unable to set them on the recovered files.

There is a fix. Linux does offer a special API for dealing with FAT filesystems. So, it would make sense for a Linux client working with a exFAT filesystem to read the permissions through that API. Those would be actually present on the files. Any other system mounting the filesystem would be able to read and write them.

If you want an API-based packet format, I think you'd probably agree that Linux's normal file API and Linux's special FAT filesystem API are different APIs with different permissions. So, an API-centric design would have two different packet formats, one for each API.

That's under Linux, but we should also consider what to do with the exFAT drive under Windows. Windows has lots of API calls to get all file attributes. But those are going to return different values for files on an exFAT drive vs. an NTFS drive. The exFAT files don't have multiple names, owners, access control lists, alternative data streams, etc.. So, if we use a single packet type to record the permissions returned by Window's API calls, it will need optional fields. That is, it will need a way to encode the NTFS value and another way to say "not present" for all the attributes not stored on an exFAT file system. Since those optional values are all determined by the filesystem, it makes sense to just have 1 bit at the front of the packet that indicates if all the NTFS values are present or not. We can go a step further because the timestamps, which are common to both exFAT and NTFS, are lower resolution on the exFAT file system. So, we can save space if we let that 1 bit also decide if we store the high-resolution NTFS timestamps or the low-resolution exFAT timestamps. So, really, the packet format will have 1 bit at the front which decides if we store NTFS permissions or exFAT permissions.

So, for this API-centric design, we have 3 permission packet types: Linux's default API, Linux's special FAT API, and Window's API. And the Windows API packet has two formats, holding the API responses for either the NTFS or the exFAT permissions.

It's worth talking about the difference between the Linux's special FAT API packet and the Windows API packet with the exFAT permissions. They will be holding the same data, because it is just the information stored by a FAT file system. They will not be the same because the timesstamps will be stored differently. Linux's API uses signed nanoseconds since Jan. 1, 1970 and Microsoft's API encodes times in 100ns since Jan., 1, 1600. That is it.

But, if we want an API-centric design, we should keep the timestamps in the format of their respective APIs. It will be up to clients to translate from every API to their native API. If we add a new API, like MacOS or Fuchsia or BeOS or AmigaOS or anything else, we will have to add new permissions packet to the specification for each API, even if all are using the same exFAT filesystem. And it will be up to every client to translate from the permissions packets from all those APIs to their native API.

But I think storing permissions based on the filesystem is better. The format we store in our packets can just copy the file system. Every encoding client can translate their data (back) to the filesystem's format. Every decoding client can the transform (again) the file system's permissions to their own OS's. The Par3 files on an exFAT drive will be the same for a Linux system, a Windows system, and for any other OS that runs a Par3 client.

FWIW, I wasn't sure if permissions should be API-centric vs. filesystems-centric before I wrote "what's the right answer? ..." in the previous post. And, after writing this post, I've wholeheartedly convinced myself that filesystem-centric is the right way to go. Have I convinced you too, @animetosho ?

animetosho commented 2 years ago

Windows has lots of API calls to get all file attributes

The response you can get is described in WIN32_FILE_ATTRIBUTE_DATA. The GetFileAttributesEx call is all one really needs.

But those are going to return different values for files on an exFAT drive vs. an NTFS drive. The exFAT files don't have multiple names, owners, access control lists, alternative data streams, etc..

Actually, they should be the same, because the API is supposed to be filesystem agnostic. (also note that Windows considers file attributes to be different from ACLs) It's the same thing as Linux making up permissions on Windows drives - the API has to act the same way regardless of the underlying filesystem.
Some attributes will be unavailable to certain filesytems (e.g. FILE_ATTRIBUTE_ENCRYPTED won't be supported on FAT volumes).

That is, it will need a way to encode the NTFS value and another way to say "not present" for all the attributes not stored on an exFAT file system

If you're strictly designing to an API, such filesystem-level details shouldn't be relevant (which is the whole point of a unified API).
Of course, I don't think being strictly tied to the API makes sense, just like one shouldn't be strictly tied to any particular filesystem. I'm just saying that APIs make a better basis to start from than filesystems do, because they're closer to the reality that an application/developer will see.

So, for this API-centric design, we have 3 permission packet types: Linux's default API, Linux's special FAT API, and Window's API

I don't see why 3 types are needed - Unix and Windows is sufficient. A client that is running on Linux trying to obtain Windows specific values will need to translate it to/from the Windows format.

This will also cater for potential differences in filesystem drivers (e.g. FUSE-mounted ntfs-3g driver vs Linux in-kernel NTFS driver).

But, if we want an API-centric design, we should keep the timestamps in the format of their respective APIs

My recommendation would be to use a single timestamp format which works reasonably well with all APIs.
Having different formats can get messy with the number of cross translations required, particularly for clients which don't even use a low level API (meaning that there's no "native" format for them).

The Par3 files on an exFAT drive will be the same for a Linux system, a Windows system, and for any other OS that runs a Par3 client.

If you go with two types, Unix and Windows, then your example should end up being the same across all platforms, provided the client can retrieve Windows details in non-Windows environments - all the info would get encoded into the Windows packet.

mdnahas commented 2 years ago

@animetosho, we've been discussing this for a few pages now. I've convinced myself that I've got the right answer. You keep arguing and I don't think you're dumb, so either I'm not saying the right things to convince you or there's some sticking point that you have that I'm not seeing.

I think talking in principles and concepts has not moved us along. I think we need a concrete counter-proposal.

I believe you want something like:

Windows-API Permissions packet:

Linux-API Permissions packet:

OR

POSIX-API packet ... POSIX doesn't specify the size of times (32-bit or 64-bit) nor user/groupIDs, so we'd have to pick some values ... POSIX doesn't support "xattrs" (which is supported by Linux, MacOS, FreeBSD, ...)

Is this the counter-proposal you are advocating, @animetosho ? You have also mentioned putting the timestamps in a single common format in one place, so, if you want that, feel free to change this counter-proposal. But, whatever you decide, please make it one single counter-proposal and be specific in every detail.

For whatever packet-structure of your counter-proposal, can you please answer the following questions. Please be specific.

If a non-Windows machine accesses a FAT/FAT32/exFAT drive, would your counter-proposal expect the client to store "Windows-API Permissions" packets, even though they're using a non-Windows API? Should a UNIX client also write the Linux/POSIX-API Permissions packets?

If a non-Windows machine accesses an NTFS drive, would your counter-proposal expect the client to store "Windows-API Permissions" packets, even though they're using a non-Windows API? Should a UNIX client also write the Linux/POSIX-API Permissions packets?

If a Par3 file is allowed to store both Windows-API Permission packets and Linux/POSIX-API Permissions packets for the same file/directory, how does the client decide which to use?

If a Windows client encounters a Par3 file with only Linux/POSIX-API Permissions packets, what would you recommend the client do?

If a UNCIX client encounters a Par3 file with only Windows-API Permissions packets, what would you recommend the client do?

When using a FAT file system, Windows manufactures metadata, like "FILE_ATTRIBUTE_NOT_CONTENT_INDEXED", etc.. That data is not stored on the drive. If the encoding Par3 client reads the manufactured metadata and stores it in the Par3 file, do we expect any problems when the decoding Par3 client uses it and calls SetFileAttributesA()?

Similarly, for FAT and NTFS file system, the Linux manufactures metadata for its usual system calls. For example, execute permissions bits. That data is not stored on the drive. If the encoding Par3 client reads that manufactured metadata and stores it in the Par3 file, do we expect any problems when the decoding Par3 client uses it and calls utimensat()?

If we want to add support to Par3 for other NTFS metadata like owner/access-control-lists, alternative filenames, additional data streams, etc., should the new packet type completely replace the Windows-API Permissions packet or do we send the new packets in addition to the Windows-API Permissions packet?

If another OS becomes popular (Fuchsia, etc.), do we add a new packet type?

If a new file-storage system becomes popular (e.g., key-object stores, local versions of Amazon S3, etc.) with different permissions, do we still use the Windows-API and Linux/POSIX API?


In the goals for Par3, I said that we wanted to do permissions, but we didn't have to do them perfectly. Just good enough for users. But permissions is also something where if we mess it up, it will break users' expectation and, possibly, security. So I'm willing to spend some time getting this right. (But we've spent a lot of time. )

If, @animetosho, you're worried about the details of implementing the current draft specification --- like how does a Windows client determine if its on a FAT or NTFS filesystem --- we can wait until someone starts writing a reference implementation of the client. There are going to be details in the specification that will change when we actually write code, because there are always details we didn't foresee or didn't put enough importance on. If you think this discussion will be clearer then, we can keep the current draft Permissions packets and restart this discussion then.

mdnahas commented 2 years ago

I tried to lock down some of the remaining details in the specification:

For the license, I have contacted the Software Freedom Law Center to help with our license. They are a non-profit legal group that helps open-source projects. I believe we will want to claim the trademarks/servicesmarks of "Parchive", "Par3", ".par3", etc.. so that no commercial company can claim them nor say "Par3 compatible" without following the specification. The Software Freedom Law Center currently takes care of the trademarks for Git, Wine, and Inkscape, so they seemed like the right place to contact. I'll let you know what they say.

For a 64-bit rolling-hash, I suggest we use CRC-64-XZ
It is defined here: https://tukaani.org/xz/xz-file-format.txt

There are not many 64-bit CRCs, but this appears to be the most common and best specified. I didn't see rolling hash implementations of it, but it cannot be too hard to modify to make it a rolling hash. (We'll see when we try to implement it.)

For the fingerprint hash, the likely candidates are still Blake3 and KangarooTwelve. Both are new, but based on existing algorithms. Newness is bad for security, but we don't care about security as much as uniqueness and speed. And both seem to be fast and unique.

Google has more hits for "blake3", "blake3 library", and "b3sum" than "KangarooTwelve...". There seem to be implementations of both for most languages. Either will probably serve us. I am partial to Blake3 because I can understand its paper better than I can understand KangarooTwelve's paper.

Thoughts on which fingerprint hash?

For simplicity, I have changed the specification to use 16-bytes of the fingerprint hash everywhere, except for the InputSetID, which is 8-bytes. That will increase per-block overhead by 25%, which is a lot, so I may change my mind back.

I'm thinking of dropping the rolling hash for the "tails" of chunks. It is hard to imagine a client using it.

Yutaka-Sawada commented 2 years ago

For a 64-bit rolling-hash, I suggest we use CRC-64-XZ

It's called as CRC-64-ECMA (ECMA 182) on wikipedia page about CRC. It's possible to implement rolling-hash with any polynomials. Just calculation speed is different.

So, I prefer to use CRC-64-ISO (ISO 3309) for faster speed than CRC-64-ECMA. While complex CRC-64-ECMA requires table look-up, simple CRC-64-ISO doesn't at hashing. (But, both polynomials need table for slide window search anyway.) Though CRC-64-ISO is considered weak for a few bit difference, this weak point is ignorable in our usage. We have another (strong, complex, but slow) fingerprint hash already. Then, we need speed for rolling-hash. If we select a relatively slower polynomial as CRC, the benefit will decrease.

Thoughts on which fingerprint hash?

I don't know the algorithm details. Faster will be better for users. Most users don't care the quality of fingerprint hash.

I'm thinking of dropping the rolling hash for the "tails" of chunks. It is hard to imagine a client using it.

It's ok for me. Normally a file has many blocks and one tail in a chunk. When a file is larger than block size, the tail can be found from the relative position of former blocks.

But, there is a bad case of small files. If a file is smaller than block size, the file has only one chunk of tail. In this case, a par client won't be able to find the small file from a packed data. Such like, non-compressed archive file. This problem is headache in verifying with PAR2 file, too.

Then, I came up with a simple solution. There is an item of "hash of the first 16kB of the file" in File Packet Body Contents. The hash is currently 16-bytes fingerprint hash. If you change the hash to rolling-hash, we will be able to find small files by the rolling-hash of the first 16kB of them.

This is good for file size also, because rolling-hash is less size than fingerprint hash. If the first 16kB of files gives a same rolling-hash value, we can confirm the data by fingerprint hash of chunk next. So, there is no collision problem.

malaire commented 2 years ago

mdnahas:

There are not many 64-bit CRCs, but this appears to be the most common and best specified. I didn't see rolling hash implementations of it, but it cannot be too hard to modify to make it a rolling hash.

I've recently made a simple implementation in Rust which includes both CRC-32C and CRC-64/XZ rolling hash as they are nearly identical in implementation: rolling-dual-crc

Yutaka-Sawada:

It's called as CRC-64-ECMA (ECMA 182) on wikipedia page about CRC.

Wrong, CRC-64-ECMA and CRC-64-XZ use same polynomial but otherwise are different algorithms and not same. See e.g. Catalogue of parametrised CRC algorithms about their differences.

That Wikipedia page also says before the table:

The table below lists only the polynomials of the various algorithms in use. Variations of a particular protocol can impose pre-inversion, post-inversion and reversed bit ordering as described above.

CRC-64-ECMA and CRC-64-XZ differ in all of those (pre-inversion, post-inversion, bit ordering).

animetosho commented 2 years ago

@mdnahas Sorry if I'm sounding repetitive - I'm trying to point out things which may be of interest, whilst avoiding being specific as to not be prescriptive.
If you've already made up your mind, then there's really little for me to add.

But since you asked, here's some of my thoughts to answer your questions.

In the goals for Par3, I said that we wanted to do permissions, but we didn't have to do them perfectly

I actually felt that you're trying too hard to be accurate - I think a less accurate model simplifies things. Have a look at other archive formats and you'll see that their handling of permissions/attributes often don't aim for maximum accuracy.

Is this the counter-proposal you are advocating

Windows Packet:

Unix Packet:

Add to File Packet:

If a non-Windows machine accesses a FAT/FAT32/exFAT drive, would your counter-proposal expect the client to store "Windows-API Permissions" packets, even though they're using a non-Windows API?

They may choose to do so, or choose not to.

Should a UNIX client also write the Linux/POSIX-API Permissions packets?

As above - they can choose to do so if desired.
Clients are encouraged to pick one or the other (or pick neither) unless they really believe that it makes sense to include both Windows and Unix info.

If a non-Windows machine accesses an NTFS drive, would your counter-proposal expect the client to store "Windows-API Permissions" packets, even though they're using a non-Windows API?

Same behaviour as FAT based partitions - they may choose to do so or not.

Should a UNIX client also write the Linux/POSIX-API Permissions packets?

As above.

If a Par3 file is allowed to store both Windows-API Permission packets and Linux/POSIX-API Permissions packets for the same file/directory, how does the client decide which to use?

The client is free to decide how to interpret this. The recommended approach would be to prioritize the native packet and disregard the other, but if the client believes there is benefit to doing otherwise, they may choose to do so.

If a Windows client encounters a Par3 file with only Linux/POSIX-API Permissions packets, what would you recommend the client do?

That's up to the client to decide. I'd recommend ignoring the Unix packet, but the client could, if it chooses, try to translate permissions to NTFS ACLs (to be compatible with MSYS/cygwin, for example).

If a UNCIX client encounters a Par3 file with only Windows-API Permissions packets, what would you recommend the client do?

Similar to above.

When using a FAT file system, Windows manufactures metadata, like "FILE_ATTRIBUTE_NOT_CONTENT_INDEXED", etc.. That data is not stored on the drive. If the encoding Par3 client reads the manufactured metadata and stores it in the Par3 file, do we expect any problems when the decoding Par3 client uses it and calls SetFileAttributesA()?

The decoding client will need to be aware of what flags it can or cannot set (or even whether or not it wishes to interpret attributes at all). The client should not pass around flags without scrutinizing them.

Similarly, for FAT and NTFS file system, the Linux manufactures metadata for its usual system calls. For example, execute permissions bits. That data is not stored on the drive. If the encoding Par3 client reads that manufactured metadata and stores it in the Par3 file, do we expect any problems when the decoding Par3 client uses it and calls utimensat()?

As above - how this is handled is up to the client's discretion. It is recommended that the client treat any violation (or inability to set properties) as a low priority problem - i.e. the client should not error out if the properties don't make sense or cannot be set.

If we want to add support to Par3 for other NTFS metadata like owner/access-control-lists, alternative filenames, additional data streams, etc., should the new packet type completely replace the Windows-API Permissions packet or do we send the new packets in addition to the Windows-API Permissions packet?

I'm not sure how you wish to do versioning, but I'd imagine you'd just make a new packet which includes this extra info. The existing Windows Packet remains as is for backwards compatibility, and is included alongside the new packet.

If another OS becomes popular (Fuchsia, etc.), do we add a new packet type?

If the OS follows either the Windows or Unix model reasonably well, and either is deemed sufficient, no new packet type is necessary. If not, clients can always choose to ignore attributes entirely on the platform.
If supporting these attributes is desired, then a new packet type would need to be developed.

If a new file-storage system becomes popular (e.g., key-object stores, local versions of Amazon S3, etc.) with different permissions, do we still use the Windows-API and Linux/POSIX API?

The client is free to pick whatever it feels most accurately models the metadata.
For the time fields (since they are mandatory), it can also use whatever it feels is the most representative (for example, if the source does not provide a last access time, it can choose to use the current time).

If storing such metadata is desirable, a new packet type, based on the API, would need to be created.

Actually, this is a key problem with filesystem based models - it doesn't cater well for cases when there is no underlying filesystem (as far as the user can see).


The Windows packet only storing a single number seems like unnecessary packet overhead. I'm mostly trying to stick close to your design, but if that is an issue, the attributes field could perhaps be merged into the files themselves (which would also mean that you couldn't have both Unix and Windows attributes on the same file, which I don't think is much of an issue).

about the details of implementing the current draft specification --- like how does a Windows client determine if its on a FAT or NTFS filesystem

That does require the client to have the ability to access that function though (higher level languages will often not permit direct access to lower level APIs). It's also unknown how well this works against locations that aren't directly mounted (e.g. network shares or some other obscure file mounting system).

mdnahas commented 2 years ago

@malaire I'd say that table on Wikipedia could use a better layout. I read it the same way as @Yutaka-Sawada , and didn't notice the comment about EMCA-182 vs. XZ.

@malaire You seem to know something about rolling hashes. Do you have an opinion on a good 64-bit rolling hash? (Or possibly, a pair of good but different 32-bit rolling hashes?) I agree with @Yutaka-Sawada that CRC-64-ISO (ISO 3309)'s problems are not going to affect our output and it looks fast (few bits in the polynomial) and it is an ISO standard.

@Yutaka-Sawada Changing the 16kB hash to a rolling hash is a very interesting idea! It not only saves space, it also reinforces the idea that the 16kB hash is not unique! Bravo!

Let's assume we do that. Do we still keep the rolling hash for the "tails" of files? The rolling hash doesn't work well, because the tails are different lengths.

I've been reading about content-defined-chunking lately, and they all work with a really small window, like 64-bytes. What if the rolling hash for tails isn't the full length of the tail, but a small fixed-sized window that identifies the start of the tail? Then, if a client is trying to find tails, it can do a rolling hash with the small window and, when it finds the start of a tail, it can use the fingerprint to identify the tail.

So, the chunk description for a tail would be:

That adds a lot of complication for "tails" and I don't think most clients will search for tails. But it does make it more likely that an aggressive client can actually find them.

Does that make sense? Is it a good thing to do?

mdnahas commented 2 years ago

@animetosho Thanks --- I hadn't seen the chattr program before. It looks like those flags are get and set via this API call, ioctl. (More on this lower down.)

@animetosho I asked "If a non-Windows machine accesses a exFAT drive, would you expect the client to store "Windows-API Permissions" packets?" and you said "They may choose to do so, or choose not to." But that's not an expectation. You later say "Clients are encouraged to pick one or the other". But that's not very specific.

I think it only makes sense if a Par3 file for an exFAT drive stores the Windows-API permissions. Those are the permissions on the exFAT file system. If Par3 is going to preserve the input files and their permissions, it has to preserve the permissions that exist and those are the Windows-API ones. For a UNIX machine accessing the exFAT file system, I don't think it makes sense to store the UNIX-API permissions.

And if a UNIX-machine is storing the Windows-API permissions, that means the file system matters more than the API.

I do think this discussion has been useful, because the API is important and looking into the APIs has opened up questions.

I think I'm against storing Linux's ioctl inode flags. First, it is Linux-specific. Second, it seems to be a step beyond "good enough". I actually wasn't sure we should include "xattr", since it is not part of GNU's C Filesystem library. But "xattr" seems to be supported in basically the same format by many UNIX OSes and some common applications were using it. So, I decided to include it. And I doubt many common applications are using ioctl_iflags. (And, after all, clients can always stick that data in their own custom packet.)

I'm okay with dropping the "...UtcOffset" fields in the FAT/exFAT packet. They are hard to access via the API, even on Windows, and I don't think many (if any!) applications are relying on them.

As for what to put in an NTFS Permissions packet, I'm not a Windows expert. Someone else will have to answer that. When I looked at NTFS metadata, I had a hard time finding an accurate description of it and how to access it. I would have thought "owner" was at least easy to determine, but I spent too much time getting into details. Perhaps we should look at the arguments to the CreateFileA system call

If there isn't any additional NTFS metadata that we want to store in this first version of Parchive 3, I'm fine with combining the FAT/exFAT and NTFS Permissions packets into a single format. For FAT/exFAT files, it looks like the higher bits of NTFS's attributes can be set to zero without any bad effect. It seems like a "good enough" solution for this version of the specification. If we add a more detailed NTFS packet in the future, we may also issue a new FAT/exFAT packet that stores less data.

Yutaka-Sawada commented 2 years ago

Wrong, CRC-64-ECMA and CRC-64-XZ use same polynomial but otherwise are different algorithms and not same.

Thanks Mr. Markus Laire for the correction. I saw each polynomial only, and I didn't care about the difference. To use CRC as a rolling hash, I prefer to take a polynomial of CRC-64-ISO without "initial value" nor "final xor value". That will be simple and fast.

Let's assume we do that. Do we still keep the rolling hash for the "tails" of files? Does that make sense? Is it a good thing to do?

Personally, as a developer of PAR2 client, rolling hashes for the "tails" will be worthless, when small files can be found by their 16kB hash (rolling hash). While a rolling hash of "tail" is useful to locate the position, it's possible by relative position from other blocks. My PAR2 client finds the last small slice by this way. If all other former blocks are lost, it cannot locate the tail slice by the relative position. But, in that disastrous case, it will be easy to recover the whole file anyway.

For example, a file consists of 100 source blocks. When 99 slices are known to be damaged and only the last slice may happen to be complete, is it worth to try finding the tail one ? Just recovering all 100 blocks would be easy, though it requires enough recovery blocks.

malaire commented 2 years ago

@mdnahas

You seem to know something about rolling hashes. Do you have an opinion on a good 64-bit rolling hash?

I know a bit about CRC math, but not anywhere near enough to know which are good/bad. I just chose CRC-64/XZ because it has same parameters as CRC-32C (except for size and polynomial) so they were easy to implement together, and it is used by xz giving it some credibility.

@Yutaka-Sawada

To use CRC as a rolling hash, I prefer to take a polynomial of CRC-64-ISO without "initial value" nor "final xor value".

ok, but that won't be true CRC-64-ISO so the documentation should clearly mention that.

animetosho commented 2 years ago

But that's not an expectation. You later say "Clients are encouraged to pick one or the other". But that's not very specific.

It's not meant to be prescriptive. Supplied metadata are hints - there's all sorts of reasons clients may choose to or not to include them or use them, from users not wanting to store them, to clients not being able to deal with them. Only the client is able to determine what's the most appropriate course of action, based on all the variables, not the specification.

I would have thought "owner" was at least easy to determine

Here's example code to do that, the key function for the specification would be LookupAccountSid (basically you need a domain, name and owner type).

mdnahas commented 2 years ago

I agree with @Yutaka-Sawada that the first and last steps of the CRC are stupid -- why do they add those?! But, if we're going to choose a standard, we'll follow the standard. Because client authors want to be able to use libraries. It's also how we debug, because we can check the output of our CRC code against known values. So, I'll make the rolling hash CRC-64-ISO.

Thanks, @malaire for your input!

mdnahas commented 2 years ago

It's not meant to be prescriptive. Supplied metadata are hints - there's all sorts of reasons clients may choose to or not to include them or use them, from users not wanting to store them, to clients not being able to deal with them. Only the client is able to determine what's the most appropriate course of action, based on all the variables, not the specification.

I wouldn't call the metadata "hints". They're data. They're optional, but it's not a "hint" that the file was read-only or executable on the encoding client's machine. To me, the word "hint" means something might or might not be true. E.g., the number of recovery blocks is a "hint" because it might or might not be true. The receiver of a hint may be able to speculatively execute code to speed things up, but needs to verify the true value before committing its computation.

I agree that the client should have freedom. The client should be able to adapt to its users and uses, within the domains of the specification. That is why I tried to use the softer phrase "would you expect...". Perhaps that didn't come across as soft. Perhaps I should have asked what you would do with your client.

I would have thought "owner" was at least easy to determine

Here's example code to do that, the key function for the specification would be LookupAccountSid (basically you need a domain, name and owner type).

@animetosho Do you think it is valuable to include the Windows file owner's name? CreateFileA takes a "SecurityAttributes" which, I think, is a pointer to the SECURITY_DESCRIPTOR data structure. That holds an ownerID and groupID and two access control lists (ACL). If we're going to include the owner's name, we should probably store the other parts too, right? I'm okay with just storing Window's "attributes" and the timestamps. Will users care about more than that? Will they care about owner, group, and the ACLs?

animetosho commented 2 years ago

Perhaps I should have asked what you would do with your client.

ParPar is built on Node.js. Node.js provides no way to get info on the underlying filesystem (even the amount of free space), so detecting filesystem isn't possible. The API only provides Unix permissions, even when running under Windows, so the best an application could do is to only include these Unix permissions if not running under Windows.
This means that I'd expect such clients to include the made-up Unix permissions for FAT mounted drives. This doesn't seem unusual to me - I'd expect various other tools like tar to do exactly the same.

Many such programming environments often treat Windows as a second-class citizen, so support for Windows specific functionality is often missing. Filesystem detection is also often iffy, particularly with what you can do with virtual filesystems. Even if it works well, there's always the chance that the client doesn't recognise all filesystems - e.g. a client built without knowing about ReFS may not know to treat it as a Windows filesystem.

In such programming environments though, the restrictions can be worked around via native extensions, third party libraries, or executing system utilities. All of these have their pros/cons, but it's understandable if a developer wishes to use none of these. (unless you're operating in a restricted environment, e.g. WebAssembly)

Do you think it is valuable to include the Windows file owner's name?

Many archive formats don't include them, so I don't think users generally expect them to be well supported anyway.

If we're going to include the owner's name, we should probably store the other parts too, right?

It's weird to include just the owner without ACLs, so I'd agree.

mdnahas commented 2 years ago

ParPar is built on Node.js. Node.js provides no way to get info on the underlying filesystem...

Well, that's a pretty restrictive API. I'm not even sure that storing permissions makes sense there.

animetosho commented 2 years ago

Well, that's a pretty restrictive API.

Hardly unusual though - take a look at APIs from other platform-independent languages/environments (e.g. Python) and you'll see they also commonly lack support for retrieving filesystem info.

mdnahas commented 2 years ago

Java has very limited file permissions (UNIX-ish but less than UNIX). Python, has pywin32. win32security Javascript was meant to run in a browser. I'm shocked it has files at all. ;)

But the platform-independent environments always have issues like this. That's the trade-off in using one. Each has their own quirks and I'm not sure we can make a spec that plays nice with all of them. I'd rather have a spec that is fully expressive at the low level, so that people can use the file-system features if they want to. The file permissions are optional, after all.

mdnahas commented 2 years ago

New draft of the specification.

Big changes:

I reordered a few things.

The rolling hash is still CRC-64-XZ This is because it has a clear, available specification. The specification for CRC-64-ISO costs money. And it has been "withdrawn". (I'm guessing that means superceded by a new specification.) The closest things to a specification for it were this page and This page If you can find a good spec for CRC-64-ISO (or another hash), I'll change it. If you have simple code to define the CRC, I would probably accept that.

16kB hash is now the rolling hash, instead of fingerprint hash.

I changed filename lengths to 1 byte. I'm not sure why it was 2 bytes. Most filesystems only support 255 byte names.

I changed the length of a path to 2 bytes. In Linux, paths are limited to 4kB and, in Windows, to 260 bytes.

I changed the length of the owner name and group names to 1 byte. The maximum allowed length on Linux is 32 bytes.

I am very happy with the specification. I feel like most of the pieces are in place. After your comments, I am likely to take "UNFINISHED" off the front of the specification and proclaim this the "BEFORE-REFERENCE-IMPLEMENTATION" version and put the specification on the website. Obviously bugs will be found and worked out during the first implementation of it. But I think we're at the point that we go to code.

BTW, I'm going to be starting a new business next week. I won't have time to write the reference implementation. I will be available to review code and write some unit tests.

Does anyone want to take responsibility for writing the reference implementation?

Par3_spec.txt

animetosho commented 2 years ago

I changed filename lengths to 1 byte. I'm not sure why it was 2 bytes. Most filesystems only support 255 byte names.

Note that 255 bytes under one encoding may end up longer than 255 bytes when converted to UTF-8. Particularly likely in encodings which use 2 bytes/character that end up as 3 bytes/character in UTF-8 (such as some East Asian character sets).

in Windows, to 260 bytes.

Windows actually supports 32767 UCS-2 characters (65534 bytes) in paths via UNC naming.

I think some UCS-2 characters can map to >2 byte encodings in UTF-8, but I doubt exceeding 64KB in a path name is likely.

BTW, I'm going to be starting a new business next week

Congrats on getting this far, and hope the business works out well!

Yutaka-Sawada commented 2 years ago

I changed filename lengths to 1 byte. I'm not sure why it was 2 bytes.

As Mr. Anime Tosho wrote already, UTF-8 encoding may exceed the limit. I tested filename length on my PC (Windows 10). I can set max 239 Japanese Unicode characters for a filename with Windows Explorer. The character is encode to 3-bytes each in UTF-8. So, the UTF-8 encoded filename length becomes 239 * 3 = 717 bytes. Thus, the filename length field requires 2-bytes.

Fingerprint hash is Blake3.

I tried to compare the speed of hash algorithms on my PC. While BLAKE3's official implementation works well, I could not compile KangarooTwelve's sample code on Microsoft Visual Studio 2019. Does someone know where to get KangarooTwelve's optimized source code for MSVC ? Anyway, BLAKE3 is much faster than MD5, when SSE2/SSE4.1/AVX2/AVX512 are available. BLAKE3 is 5 times faster than MD5 on my PC with AVX2.

About rolling hash, CRC-64-XZ is enough fast with CLMUL (Carry-less Multiplication instruction set). Because most current Intel/AMD CPUs have this feature, there would be no problem. I put an example of speed difference on my PC below. Note, speed is varied by CPU's extension or SIMD. This result is just a sample of a recent CPU. (It may differ from old CPU or later CPU.)

Tested with Intel Core i5-10400 2.9 GHz (single thread)

CRC-32 (table lookup) : 636 MB/s
CRC-32 (with CLMUL) : 5571 MB/s

CRC-32-C (SSE4.2, 32-bit) : 4739 MB/s
CRC-32-C (SSE4.2, 64-bit) : 9132 MB/s

CRC-64-XZ (table lookup) : 656 MB/s
CRC-64-XZ (with CLMUL) : 6097 MB/s

CRC-64-ISO (table lookup) : 653 MB/s
CRC-64-ISO (64-bit) : 3875 MB/s
CRC-64-ISO (with CLMUL) : 6097 MB/s

MD5 : 692 MB/s
BLAKE3 (with AVX2) : 3875 MB/s
malaire commented 2 years ago

Even old CPUs can get good speed. I have 9 year old CPU and get these numbers:

Tested with Intel Core i5-3570K 3.4 GHz (single thread)

CRC-32-C (8 kiB slicing-by-8 table lookup): 2400 MB/s
CRC-32-C (hardware): 15000 MB/s

CRC-64-XZ (16 kiB slicing-by-8 table lookup): 1700 MB/s
CRC-64-XZ (hardware): 3200 MB/s (*)

*) Library I'm using claims up to 28200 MB/s but that probably requires newer CPU.

Of course all of these are for non-rolling variant. I don't know if rolling hash can be sped-up (it's 240 MB/s with table lookup).

malaire commented 2 years ago

btw, will Par3 support only slow matrix-based O(n^2) Reed-Solomon or is faster O(n log n) Reed-Solomon also supported?