Feature Request: Support the optional "Input File Slice" packet format

EmperorArthur commented 3 years ago

The ability to store the actual input file in Par2 would be extremely useful. As such, I propose that par2cmdline implement this optional packet type.

Original message

Hello,

I know @mdnahas is considering what to do with the future of Par2. I would like to throw my hat into the ring, and suggest two changes.

The first is relatively minor, but would have a major impact as far as using par2 as a recovery /transfer medium. Instead of creating multiple files, or dealing with chunk size mismatch (see #159), just add the option to include the original file in the archive. Call each chunk an "ORIGINALBLOCKPACKET". With recovery set at or over 100%, the system can already do this, but this formalizes Par2 as a basic packetized archival program.

The second, leverages existing technology to solve @mdnahas issue of what metadata to store, and how to store it. A "TARHEADERPACKET" with a single byte enum for the header type* followed by a tar file header would work perfectly, and would be extremely easy to implement.

With both of these changes, Par2 (or whatever it is called) would become a fully fledged archival program with all the power of Tar. At worst, it would have some sleight overhead and would be the best archive format for anything that needs to be split. After all, right now, even if every single name of every single Par2 file is mangled a simple cat *.par2 > good.par2 will create something that works perfectly. Extending that to archival needs means that the dangers of file name mangling when working with split archives goes away.

I would love to hear other's thoughts on the issue. For example, this may provide some of the functionality @mkruer was thinking about.

* As python calls them:

USTAR_FORMAT
GNU_FORMAT
PAX_FORMAT
Any future format that may exist

EmperorArthur commented 3 years ago

Further examining things, a "TARINFOPACKET" per file would also be needed, and would supersede the "name" portion of a "FILEDESCRIPTIONPACKET".

For those that know Python, the docs provide a good reference to how Tar files work. https://docs.python.org/3/library/tarfile.html#tarinfo-objects

mdnahas commented 3 years ago

Your ""ORIGINALBLOCKPACKET" is already part of the spec. The appendix has a list of official packets that clients can choose to implement. It is called the "Input File Slice" packet.

I have considered using TAR (and ZIP and other archive format) to store file/directory attributes and to merge multiple files into one file. An existing format could be a good choice, since the code for them is already working and they handle all sorts of cases.

But TAR is old. Its specification is very crufty. (And there are multiple specifications!) If you want to cry, look at TAR's checksums. There are none for the archive file. There isn't even one for each file! The block header has a checksum and it's computed in the weirdest way possible. Having looked at the spec, I am appalled that we're still using that tool!

I've looked for an alternative. I've probably read 10 archive file specifications and another 6 file system specs. I'm not happy with any of them.

I've actually started working on Par3 again last week. I've got a rough design, but the hard part is making the standard Par2 use case work well.

The rough design has two layers: merge and redundancy. The "merge layer" merges multiple files into a single file and the redundancy layer provides redundancy for the single file. So, in the standard Par2 use case, two input files ("input1.txt" and "input2.txt") would be merged into a single file and then redundancy is applied to the single file. Next, the sender sends "input1.txt" and "input2.txt" and a par3 file that contains the redundant data. If we assume the files are damaged in transit, how does the PAR client know to look in "input1.txt" and "input2.txt" for the data blocks? That information is in the upper layer, but it isn't available because that data is damaged.

Notice that this problem exists for any "merge layer", even if we use TAR or another existing archive format.

EmperorArthur commented 3 years ago

Thanks for the info. I had made some assumptions about the spec, and had missed the optional parts by jumping straight to implementation.

Metadata / merge layer recovery

Realistically, the big questions when it comes to the "merge layer" are:

How much data, relative to the rest of the archive does it take?
What type of data corruption are you concerned about?

From an anecdotal standpoint, I find the three types of corruption/signal noise that happen are intermittent, seemingly random, and where it's mostly noise with occasional good signal / data. The trick with intermittent noise is to determine an acceptable frequency and percentage. A classic example of low percentage noise, but with debilitating frequency would be a RAM stick with a bad line. Sure 255 out of every 256 bits are good, but that one bad bit, always in the same place, would destroy many recovery schemes.

The industry standard practice seems to be using Forward Error Correction (FEC)* to deal with small errors as compared to the Par2 approach of recovery blocks. Of course, redundancy of metadata is still important, especially if there is a chunk just completely corrupted. However, you may consider just using an extremely tolerant ECC for the metadata and continue to use the complete redundant approach.

Depending on what metadata is not recoverable, even Par2 can (in theory) continue. The only packets it needs are "Input File Slice Checksum" and "Recovery Slice." I should just be able to point the code at all of the files manually (or all the files in a directory) and let it go. Each slice of the input file only consumes 20 bytes. That means you can have a significantly large number of slices before things start becoming an issue, even with redundancy.

* Yes, I know I am not saying anything new, but others may be reading as well.

Metadata format

Personally, I don't think you'll ever find a format that you will like for storing metadata. They all have issues. Part of the problem is also that different OSs have different fundamental and widely used formats. ACLs on Linux act similar to Windows permissions, but there are issues with that analogy, and even within Windows trying to preserve anything other than basic permissions while archiving a file is not easy.

Older Tar formats either ignored the issue, or went with the "key, value" approach to solving the problem. Which, unfortunately, is the best I can offer as well. Personally, if I were to go "key, value" I would use msgpack or JSON, but those didn't exist when any of the Tar formats were standardized.

EmperorArthur commented 3 years ago

One other note, but I had forgotten about the joys of getfacl and ".facl" files. Which is the standard for extended permissions on Linux.

mdnahas commented 3 years ago

I believe Access Control Lists (ACL) are just one example of extended attributes ("xattr"). "xattr" are key,value pairs.

If we intend to replace TAR and other archivers, I think the best approach is to have a subset of common file system attributes (e.g., read-only flag, create time, last modify time, etc.) and also store all of the original attributes. If someone extracts the archive on the same system, we can use the original attributes. If it is on a different system, we'll use the common one.

But even having that approach is not enough. E.g., some attributes may store a userid (a number) when you want the username (a string). Also determining if a file is being extracted on "the same system" is complicated because some Linux filesystems support different features or different limits from other ones. But I think we can find something acceptable.

As for the "find a data block not in the Par3 file", yes, one solution is to have the user pass a directory or list of files to the PAR client. In general, I don't like changing a user's expectations about the interface, but that might be acceptable. Still, a user might have a problem if they want to provide redundancy to some files, but not all. E.g., fill up unused disk space with redundant data for only some files on your machine.

Another possible solution is to provide a "hint" in the packet on the redundancy layer. That is, a list of file names to check. But that could be a long list of file names. And it would only be a hint, which means we need a fallback strategy if it doesn't work.

I'm not sure there is a perfect solution to this problem. I think I've thought it through enough. I need to make a call, write up my ideas, and then listen to see if anyone else has an insight.

Parchive / par2cmdline