Working on major Par2 changes. Name?

Parchive / par2cmdline

Official repo for par2cmdline and libpar2

http://parchive.sourceforge.net

GNU General Public License v2.0

726 stars 75 forks source link

Working on major Par2 changes. Name? #130

Open mdnahas opened 5 years ago

mdnahas commented 5 years ago

Hi everyone,

I wrote the specification for Par2 a long time ago. I'm working on the code for a new version of Par. It will include:

Reed-Solomon encoding with James S. Plank's correction
Tornado Codes by Luby

I've spent a week learning the code. I've written unit tests for some of the existing code. The tests should allow me to modify the code without breaking it. The unit tests should be run as part of "make check" but I don't know how to add them. (I've never learned Automake). Can anyone explain how?

I also plan on writing a diff tool that can compare Par files to make sure the packets are bit-for-bit identical. I'll use this to make sure that my changes haven't affected the program's output for version 2 of the specification.

I plan on adding a "doc" directory, which will contain the old Par2 specification and the new specification.

The Tornado Codes will need a predictable pseudo-random number generator. I expect I will use a version of Linear Congruential Generator.

The big question I have is: what do we name the next version and do we want to add a new file extension? At this moment, I plan on keeping all of Par2's packets and just adding new recovery packets. This will mean that par2 clients will still be able to verify the file, but will not be able to fix it. Unfortunately, par2cmdline currently silently ignores any packet type it does not recognize. So, existing users won't know why they cannot fix it. I would normally call the new specification Par2.1 or Par3, except the name "Par3" has been used by the developer of MultiPar. Perhaps we should call it "Par4"?

When we decide on a new name, I'll push a new branch and everyone can take a look at the spec/code.

Mike

animetosho commented 5 years ago

Interesting!

You seem to be interested in retaining most of PAR2 - I presume, only changing the erasure coding used? What's the aim of your change? Just to improve encoding/decoding performance?

Have you looked at MultiPar's efforts and ideas with a PAR2 successor? https://www.livebusinesschat.com/smf/index.php?board=417.0
It seems like they're looking at a more substantial change to the format, rather than just an encoding change, but I wonder, wouldn't it be more worthwhile trying to do this instead of making a only-slightly-backwards-compatible PAR2 successor? Is there that much value in being able to verify data with an old PAR2 client?

As for your question, I feel a little uncomfortable with keeping the same extension. PAR2 is fairly old and well established, and as you've found, at least one client (possibly more) wouldn't handle such a change particularly well.

Thanks for your work nonetheless, I'm interested to see how it turns out.

mdnahas commented 5 years ago

I have two goals. The first is to fix the bug in the Reed-Solomon code that has existed since Par1. I tried to fix this in Par2 by using only certain base values. That lessened the problem, but didn't fix it.

My other goal is to start using low-density parity check (LDPC), which are no longer covered by patents in the USA. LDPC is much faster than Reed-Solomon, with a slightly increased chance of non-recovery. There are many forms of LDPC; Tornado Codes are simple ones. If I do the spec right, repair will work for many LDPC, not just Tornado codes. Then, people who implement the spec can play with multiple different encoding schemes.

One of my restrictions is that I want to do this change in about a week's worth of work. I have some free time, but not much. Having looked at the code, I really want to spend time cleaning up the code and making it work like a library, but don't have enough time to find all the bugs I would create by doing that. I think I can fix Reed-Solomon and add support for Tornado Codes in 40 hours.

I've tried to understand the author of MultiPar's efforts. But I haven't see a spec for their "Par3". I don't think they are devoted to open source code. I designed the Par2 file spec to be forward-compatible - it was designed to add features. I'm sure the Par2 spec could be improved by a redesign, but I don't see a huge win for it right now. Certainly not without a large number of people devoted to implementing it and checking the code.

animetosho commented 5 years ago

Thanks for the response and explanation!

So it seems like this is more of just an experiment for encoding schemes rather than some sort of end-user oriented, production-ready PAR2 successor?

Totally understand the lack of time. It's just that I don't think a PAR2 successor should really be rushed, particularly if the aim is to continue to support it for the next ~10 years or more. Parchive is often used for archival purposes (hence the name), so it's a format you generally don't want to be changing often, which is, unfortunately, often at odds with an experimental scheme.

Not trying to discourage change or contributions in any way, but I thought it should be pointed out nonetheless.
PAR2 does have quite a number of issues, some as you have noted, so addressing some them is certainly a welcome move.

But I haven't see a spec for their "Par3". I don't think they are devoted to open source code.

"PAR3" was never really a finalised spec, only a proposal at most. The MultiPar author was experimenting with some new ideas, and decided to call it "PAR3" and included it in MultiPar. Unfortunately, people actually started using it for non-experimental purposes, so he had to pull it.
If you're interested, here's the last version of the "PAR3" spec, last updated in 2010. I think the author did aim to have an open standard, but it didn't go anywhere. I think the new approach seems to be with more open discussion around it, hence the link I previously gave. At this stage, there hasn't been anything beyond ideas/discussion however.
In short, PAR3 was never more than some experimental implementation that didn't really go anywhere and had to be pulled. In some ways, it seems similar-ish to what you're proposing at the moment - e.g. if it's aim is more experimental, and if it's labelled as some PAR2 successor, end users may end up inadvertently relying on it, when they shouldn't be.

Regardless, I mention this as they seem to be the only people still looking into a PAR2 successor (as well as one of the few actively developed PAR2 implementations), so thought you/they may be interested in such a proposal that you're making. I'm not too sure if you'll get much of a response here, but if you put up your proposed specification, I can share a link to it over there to get some of their thoughts.

Again, I personally do appreciate the effort you've gone (and will go) to, and am interested to see what you come up with!

mdnahas commented 5 years ago

Thanks! I tried to keep tabs on the Par3 discussions back then, but I never saw a specification.

So their old proposed Par3 changed every packet. The changes include:

uses Reed Solomon FFT codes
drops MD5
adds two different 64-bit CRCs (to correct small errors ... but not to replace MD5 (?))
use 1 and 2 byte integers (some integer fields are variable length)
packet types are numbers, not 16-char strings
files are identified by integers, not MD5-16k hashes
pack the ends of files (parts less than a full slice) together

I'm not a fan of the last 4 changes - it increases code without specific benefits. In fact, using packet numbers rather than strings means it is difficult to add custom extensions and to debug things.

Does anyone have an analysis of how much time par2cmdline (or another client) spends doing each task? E.g., how much time in CRCs, MD5s, inverting the matrix, etc.. If MD5 is taking up too much time, maybe we should replace it. Otherwise, it seems foolish to drop backwards compatibility for non-recovery packets.

Does anyone have the "noe" program or "libnoe" library for computing Reed-Solomon with FFT? Or perhaps another implementation? I'm not tied to Plank's corrected version. Fixing that problem is certainly something I'm in favor of.

As far as "an experiment" vs "end-user oriented, production-ready PAR2 successor", I do not want to stall all development for a successor just because people use PAR2. We need to experiment. We should work on fixing Reed-Solomon and we should work on adding support for LDPC. We do need to be careful communicating to users what is untrusted. We also don't want to break "par2cmdline" for Par2, which is why I'm (1) not changing much, (2) adding unit tests and (3) asked another developer to review my code before it is merged.

animetosho commented 5 years ago

using packet numbers rather than strings means it is difficult to add custom extensions and to debug things

Going out on a limb here (I haven't read the proposal), I would imagine that the typical approach here is to allocate ranges for custom extensions. I also imagine that ease of debugging was not goal of the proposal either.

it seems foolish to drop backwards compatibility for non-recovery packets

That's an interesting opinion. Personally, I don't see much benefit in retaining backwards compatibility just for non-recovery packets, seeing as the whole point of PAR2 is the recovery packets. I suppose there's some use with using it to perform hash verification, but I don't feel that this is the primary aim of PAR2, and imagine that other methods (e.g. just storing a hash) are more frequently used for cases where only verification is necessary. The ability to verify is also limited, because most PAR2 tools also check the recoverability if bad blocks are found, which obviously won't work if a new type of recovery block is introduced; I don't recall the PAR2 specification even allowing for new types of recovery blocks, so there's no reason or expectation for clients to handle such a change in an ideal manner either.

You do mention that you have limited time to work on this, and hence want to minimize changes, which is totally understandable, but I see this more as a constraint rather than a benefit for minimizing changes.

it increases code without specific benefits

This doesn't really address your concerns, but here's a list of issues the MultiPar author has with PAR2, written 5 years after the PAR3 proposal, that he'd like to resolve with a PAR2 successor. May give some ideas on the reasoning behind the proposed PAR3 (e.g. using numbers to maybe reduce the size of the PAR archive, further enhanced by compressing filenames?).

Does anyone have an analysis of how much time par2cmdline (or another client) spends doing each task?

I only have experience with implementing a PAR2 encoder, so haven't looked much into decoding (so not much knowledge/experience implementing matrix inversion), and what I'm about to say relates to encoding.

As RS coding is relatively slow, generally the time spent hashing is negligible. However, if you replace RS with something faster, hashing will become a larger part of the overall speed to create/repair PAR2.

As far as I'm aware, all PAR2 clients (other than ParPar) perform an initial hash pass to compute the MD5/CRC32 of each file/block, during which, you're probably limited by I/O rather than CPU. NMVe SSDs, which are gaining mainstream traction, likely tip things the other way, though running MD5 in multiple threads is probably sufficient to keep I/O being the bottleneck. Another possible trick is to use a multi-buffer (SIMD) MD5 implementation, but I think only ParPar is the only PAR2 client which implements that.
MD5 is reasonably fast for a cryptographic hash. For cryptographic purposes though, it's not recommended due to existing collision attacks against it. The use of a cryptographic hash function may be interesting if one can rely on it being secure - for example, you could use it to uniquely identify files and clients might be able to make certain assumptions around it, although I imagine the benefits may be limited.

CRC32 is generally very fast, and modern x86/ARM CPUs have acceleration for it. I think rolling the CRC and checking for matches, for finding misaligned blocks during recovery, can be somewhat slow though. Rolling can also be problematic if you have to re-compute the MD5 for every potential match, which can lead to a kind of DoS attack against a PAR2 client attempting recovery.

Here's a discussion around hashing in PAR2 and successors.

I/O can be a limitation when processing large amounts of data which cannot fit in memory, due to the need for certain seek patterns. This may become less of an issue with the dying use of rotating media for storage though.

Note that par2cmdline isn't particularly a performance oriented implementation of PAR2. For most of its history, it's been a single threaded, non-SIMD, no-ASM implementation. There have been various forks which have added these performance features (e.g. par2-tbb, phpar2). It's only more recently that par2cmdline-mt was merged in, adding multi-threading to par2cmdline (but still no SIMD). MultiPar is more performance oriented, so obviously the author is more concerned about such issues with a proposed successor.

For some rough performance figures on the fastest implementations I know of, off the top of my head. On one thread of a 4.3GHz Skylake-X CPU (supports 512-bit SIMD via AVX512):

MD5: 800MB/s (OpenSSL ASM implementation)
MD5 multi-buffer: 3GB/s (I think? can't remember exactly now) using 128-bit AVX512VL. A full 512-bit AVX512 implementation should be a bit under 4x that, so potentially >10GB/s
CRC32: >25GB/s using the carry-less multiply PCLMULQDQ instruction (memory read bandwidth per core is around 17GB/s, so this will be limited by it for buffers which don't fit CPU cache). The more common "Sarwate" algorithm won't achieve that speed, but I'd guess is still at least 1-2GB/s.
GF16 multiplication: 45GB/s using AVX512's 512-bit VPSHUFB instruction. I seem to be L2 cache bound at that speed, so there's potential further gains by processing multiple blocks at a time, rather than the simple "multiply" function I have in ParPar. Despite this speed though, RS coding obviously requires multiple multiplies, depending on how many recovery blocks are generated. Very roughly speaking, for 1000 recovery blocks, you're probably looking at 45MB/s per core, which is likely going to be your primary bottleneck if not reading from a harddisk
NMVe SSD: around 3GB/s sequential read. 4K random QD32 is probably around 1GB/s
harddisk: typically 100-200MB/s sequential read, 0.2-0.5MB/s 4K random

I do not want to stall all development for a successor just because people use PAR2. We need to experiment

Oh definitely! It's just that that wasn't clear from your first post, considering that this repository is more a user-facing tool rather than for experimentation, and the questions asked around the name (which sounds more like an issue if you're proposing an actual standard rather than experimentation) initially gave me the impression otherwise. Hence my asking and confirming.
I'd personally prefer to keep such changes out of the main par2cmdline branch until something more final, otherwise someone is bound to misuse it. Interested parties can grab the experimental fork to test changes :)

Thanks again for the reply!

mdnahas commented 5 years ago

I don't worry about intentional MD5 hash collisions or denial of service. We use MD5 for uniqueness, but I don't think our users rely on that. (Or, if they do, they probably know what MD5 is and its security issues!) I do worry about buffer overflow and writing system files.

If we want people to implement Par3, it's easiest if they can just reuse the code from Par2. Why recode the wheel?

I agree that par2cmdline should not be tweeked for performance. It should be a clear roadmap for how to implement the standard. I actually don't think it does a good job of that - par2repairer.cpp is 2750 lines. I guess it needs to be multi-threaded (or use fork?) because it is compute intensive and everyone has multi-cores now-a-days.

Is GF16 multiply 45GBps or 45MBps? If it's 45MBps, then it and I/O are our bottlenecks. Changing to LDPC might improve things.

I'm not sure what I'll work on now. Maybe I'll just spend the week cleaning up par2cmdline. I hate modifying a working program without enough tests, because I may break an untested feature. (There's a lot of commandline options to test!) But a cleaner version would make it easier for people to copy the code and then improve performance of pieces.

If I skip additions and just improve par2cmdline's code, there's things like:

create a par2 library and have par2cmdline process commandline args and then call the library. (This is hard because the commandline class currently does things like look up the size of files!!)
automatically handle the case when the bug in Plank's Reed-Solomon appears. (I didn't see code to handle this. Is it there?)
print message if a packet is detected but type is not handled
make CRC and MD5 have the same interface (they are both hashes, but the class interfaces are very different). I'd like to encapsulate the many ~0 that are currently in the code without any explanation.
get rid of class variables for par2repairer and par2creator. Most are currently acting like global variables in a 2700 line program, rather than class vars. Because of this, it is hard to see where data is created and consumed.

As I said, these are huge changes and I'm loath to make them, because there's a good chance that I'll introduce bugs to a long-run tool.

animetosho commented 5 years ago

Is GF16 multiply 45GBps or 45MBps?

Multiplying a buffer by a constant can be done at 45GB/s using 512-bit SIMD. In other words, a single multiplication is 45GB/s. The implementation is based off Plank's "Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions" paper (you can find his implementation here). My (much less readable) implementation extends his 128-bit SIMD code to use 512-bit SIMD (available on the latest Intel server processors), and adds a few small performance tweaks on top, effectively achieving around 4x the speed of Plank's implementation on CPUs with AVX512.
I've also investigated an alternative GF16 multiply kernel which relies on JIT'ing XOR code for each multiply. My current implementation seems to generally outperform Plank's algorithm, at the expense of a significantly more complex implementation (perhaps ridiculously complex in fact). It currently doesn't scale as well to 512-bit SIMD though (around 40GB/s peak), as it is mostly stalled by L2 cache, though I believe it can be further tweaked, and has the potential to be much faster.

Here's a benchmark I performed with some GF16 multiply techniques. The "lh_lookup" value refers to the technique of splitting the 16-bit multiply into low/high 8-bit multiplies in a 512-entry lookup table (basically the same method used in par2cmdline) - this is the fastest, portable, pure-C technique I know of, but as it doesn't use SIMD, falls behind the other techniques on performance. "shuffle" refers to my implementation of Plank's method above, at 128/256/512 bit SIMD widths. "xor" is the technique of using JIT to generate XOR sequences linked to above, at 128/256/512-bit SIMD widths. "memcpy" is just the standard glibc implementation of memcpy - mostly useful as a reference to give an indication of cache/memory bandwidth.

gfspeed

(looks like I accidentally chopped off the X-axis label, which is the buffer size used; you'll find a similar graph in Plank's paper, but his implementation only uses 128-bit SIMD)

However, in reality, it's unlikely that users create PAR2 with only a single recovery block. Which means that more multiplications need to be performed, proportional to the number of blocks to be generated. The example I gave was with 1000 recovery blocks, which means 1000x region multiplies for each input block, hence giving the (very rough) 1000x overall speed decrease (=45MB/s per core).

Note that this speed was achieved on a very recent, high power Intel CPU with AVX512 running at a high clock rate. Most users do not have access to such powerful hardware, so the figures I'm giving are somewhat at the upper-end of the scale of what's possible on a CPU today (a more typical x86 processor today would probably get half that speed). At the other end of the scale, a low power ARM CPU (with NEON (SIMD) support), for example, may only achieve around 300MB/s using the above algorithm for a single multiply. CPUs without SIMD units may perform even worse.
(it's also worth pointing out that that computers with powerful CPUs are likely paired with fast drives, in which case, I/O is unlikely going to be a bottleneck in practice)

So despite the various tricks you can pull for a highly optimised/tuned implementation of GF16 multiplication, they come at a complexity cost, and there's still fundamental limitations, hence a faster erasure code is definitely welcome. Particularly, something that can grow at O(n log n) complexity, instead of O(n*n) would be useful, especially if we want to scale beyond 64k recovery blocks.

Just to make sure you're aware, I don't work on or maintain the par2cmdline codebase, so can't really comment much on your suggestions (hence I won't be able to do much regarding the comment you made in your latest pull request). I'm mostly just an interested 3rd party who works on my own PAR2 implementation (which has somewhat different aims to existing PAR2 implementations (e.g. one thing I wanted was to be able to process data from streams rather than be tied to on-disk files)). There isn't much development going on with par2cmdline these days though, so I would suspect that any contribution you're willing to make is valued. Your suggestions do sound like good changes nonetheless, and I appreciate you putting your time and effort into this.

there's a good chance that I'll introduce bugs to a long-run tool

All changes have such risk, but without changes there is no progress. You sound like you're fairly cautious already, so I wouldn't worry too much about it. All software has bugs - if one crops up, someone just needs to fix it.

Yutaka Sawada (MultiPar author) made a comment regarding your initial proposal. I've copy/pasted it below:

Thank you for notice. It's interesting, and I will help as possible as I can. MultiPar will support it, too.

Because I don't know about Tornado Codes, I cannot say it's good or not. But, "1.Reed-Solomon encoding with James S. Plank's correction" would be bad, because it requires inversion of matrix. When there are many blocks, inversion will become almost impossible. (from calculation speed and required memory size.) I prefer to use Cauchy Matric for quick calculation, which Mr. persicum suggested ago.

As I cannot explain the mathematics, I put some paper about the usage.

Filename: oninversionofcer00sche.pdf "On the Inversion of Certain Matrices" by Samuel Schechter, February 1, 1959 AEC Computin and Applied Mathematics Center Institute of Mathematical Sciences New York University

Filename: 0301097.pdf "The Efficient Implementation of Reed-Solomon High Rate Erasure Resilient Codes" by Jin Li Microsoft Research, Communication and Collaboration

Another issue of PAR2 is order of input files. Because filename affects the order of input files in current PAR2, it's impossible to change filename later, even when contents of files aren't changed. Now, a user needs to recreate whole PAR2 files after renaming a source file. Some users requested a feature of renaming filenames over MultiPar. Though I include a tool to rename filename in MultiPar package, it's out of PAR2 spec. If new PAR spec (maybe PAR3) solves this problem, it will be convenient.

Rhialto commented 5 years ago

On Tue 30 Apr 2019 at 07:35:08 -0700, Michael Nahas wrote:

create a par2 library and have par2cmdline process commandline args and then call the library.

That would be quite useful!

Currently there is a tool that I sometimes use (nget, an nntp downloader with automatic use of par2 files) which includes a copy of par2cmdline, but modified a bit. The tool is also abandoned, but I cloned it to gitlab: https://gitlab.com/Rhialto/nget .

It would be great if I could modify it to use a proper library (eliminating the copy and the annoying modifications).

mdnahas commented 5 years ago

Thanks @animetosho for the GBps / MBps explanation. memcpy is a good reference for speed.

And thanks for forwarding the comment on "matrix inversion" from Yutaka Sawada. As part of a job, I've actually tried to invert large matrices and I found that doing more than a 400x400 was impossibly slow. I'm not sure we should use the term "Reed-Solomon" if we're not actually doing Reed-Solomon. I've been thinking of renaming the class to "reedsolomon_broken".

I find it interesting that people are changing filenames after creating Par2 files. That isn't something I anticipated when writing the spec! Currently, the spec says that clients shouldn't try to verify how the "recovery set ID" is calculated. We could amend it to say that they shouldn't try to verify how "file ids" are calculated. We did an informal amendment to a version "2.1" which allowed clients to use UTF-8 in filenames (instead of pure ASCII).

@animetosho The streaming idea is interesting. Are you using the optional packets that allow you to include slices of the input file?

animetosho commented 5 years ago

Yeah, there are all sorts of things people out there want to be able to do. Anothing thing is being able to add files or update a PAR2 archive. Currently this requires recomputing the whole thing due to the strict file ordering determining the coefficients used for computation.
Not being limited to 32k input blocks/files is also a nice thing to have.

When you get down to it, there's a lot of things that Parchive could be doing but currently can't in PAR2, which is why I personally think a major overhaul of the format is warranted.

Are you using the optional packets that allow you to include slices of the input file?

I am not including parts of the source file in the PAR2 archive. Streaming is more in the sense of being able to pipe data in from memory, as opposed to using PAR2 as some on-the-fly FEC for lossy network transmission or the like (though you could do that too).

For example @Rhialto gives the example of a usenet downloader. Currently these write all downloaded files to disk, including PAR2 files, then perform verification/repair. If those could be performed somewhat on-the-fly, whilst the download is occurring, you can save time over doing separate processes, whilst also reducing disk I/O (and even avoid writing PAR2 files to disk entirely). Unfortunately, existing PAR2 implementations are too closely tied to the notion of on-disk files to be able to do this.
I also had a request to be able to have stdin directly piped in for processing, which is interesting because you don't know the size of it in advance, in addition to it being a stream.

mdnahas commented 5 years ago

If you can collect the use cases that Par2 doesn't work well in, I can take a look at how the design might change.

I think we can handle renaming files by just not requiring clients to confirm the File ID.

If we want to handle single-byte changes in files, like the proposed Par3 spec you forwarded, that does require big changes.

Do people really want >32,000 blocks? Each block adds overhead (CRC32 and MD5 checksum). I mean, it's very doable, but I'd want to shift to something like LDPC, which has less computation per block.

Streaming is possible with minor changes. You need some initial information to initialize the state in the reader. That would be (1) input file descriptions, checksums, etc. and (2) the number of recovery buffers to allocate and their coefficients. Then, as data was downloaded, you need to identify what input slice it was, write it to the correct place in the file, and subtract its effect from each of the N recovery buffers. Lastly, when all the data had been downloaded, you need to identify the K slices that need recover. If K < N, download the recovery data for K slices, add it to the appropriate recovery buffer, solve the matrix, and write out the recovered data.

If you assume the initial information is the no-recovery-slice Par2 file, then the only thing we are missing is the number of recovery slices and their coefficients. An adventurous client could guess those from the names of Par2 files. Or we could put that data in the Par2 files. That is, a new packet that is a hint to the repair client about the number of buffers to allocate.

Streaming would use a lot of CPU and storage unnecessarily, but it would be cool to try. I like that the downloading program would know exactly where to put the data blocks and the client wouldn't need the logic to rewrite the data files.

animetosho commented 5 years ago

I'm not sure about collecting use cases, but some ideas are listed here.
I think all text fields should be defined as UTF-8 (edit: looks like you've already done this), which makes the "unicode" (presumably fixed-length UCS-2, though encoders would likely use variable length UTF-16 regardless) packets redundant. Actually, I'm always already using UTF-8 for the "ASCII packets" anyway - despite what the standard says, many clients just use whatever the default system codepage is for these. If you have non-ASCII characters, the filename will be barfed whatever you do, but UTF-8 is generally more "kind" in these situations, and if the system codepage is UTF-8, it could just work anyway.

Do people really want >32,000 blocks? Each block adds overhead (CRC32 and MD5 checksum).

The overhead is relatively small for larger blocks, e.g. 1MB blocks. However, even at 1MB blocks, 32k means you can't have more than 32GB in a Parchive without increasing block size. Though not common, it's not too unusual to have >32GB files today. And if we're thinking about a format which is to last the next 10-20 years, it can feel a little limiting.
I suppose how large blocks should be does become a bit of an open question, but I'd like to think that the block size should increase at a slower rate relative to the amount of data typically covered by Parchive.

Streaming is possible with minor changes.

As you alluded to, it's already possible (most of the time) without any change, though the decoder may need to be a little clever about it.

Knowing how many recovery blocks are available may actually not be necessary, and in fact, may be largely irrelevant (at least for the downloader use case).
To expand on that, in most cases, K (damaged blocks) is significantly smaller than N (available recovery blocks), so it's somewhat of a waste of resources to compute on N recovery blocks during download. Now in a pure streaming situation, you don't have much choice, but for the download situation, you can always go back and compute any additional recovery you need, if there was more damaged blocks than you initially guessed.
This greatly reduces CPU/RAM usage (vs computing all recovery), and seems to be a practical optimisation for the use case. Particularly since downloading is often network bandwidth limited, followed by I/O, whilst CPU is often fairly idle; it makes some sense to expend some idle CPU resources to potentially speed up the process. For example, a downloader may choose to always compute 32 recovery blocks during download, regardless of availability; as the download is network bound, this only increases CPU usage a little, but doesn't otherwise impact download speed. When the download is complete, if there are fewer than 32 damaged blocks, they can be repaired without re-reading the downloaded files off disk. If there are more than 32 bad blocks, you'll need to re-read the files to compute all the needed blocks, but in this case, the previously computed blocks can actually be re-used to reduce the amount of computation needed.

The remaining issue is the recovery coefficients. These could be guessed from the filenames, as you pointed out, though you could also just guess the first few coefficients typically used as there's generally little reason that an encoder wouldn't use those.
If the assumptions don't hold here (e.g. no index file present, or unexpected coefficients), the downloader can just fall back to existing less efficient techniques. This is for the downloader case though, where pure streaming isn't a requirement, rather, it's only an optimisation.

Encoding in a streaming fashion is a little easier than verify/repair, since the create tool has full control over the process. The primary limitation here is being able to fit all recovery blocks in memory.

(note that the ParPar project wasn't aimed at changing PAR2, rather, at working within its bounds)

Yutaka Sawada added another note:

Though this isn't serious problem, checksum in PAR2 (CRC-32 and MD5) would better be changed in the future version. For example, BLAKE2 is faster and safer than MD5. CRC-64 is faster and has less collision than CRC-32, but it requires additional 4-bytes. Because BLAKE2 has no disadvantage, there is no reason to avoid it. Using MD5 for a new specification may be obsolate. Because CRC-64 is double size than CRC-32, it gives slightly larger recovery files. For current max 32768 blocks (or files), CRC-32 is enough. If new spec supports more blocks, CRC-64 will be preferable. (It depends on how many blocks.)

I put some paper of BLAKE2, CRC-64, and MD5's collision.

Filename: blake2.pdf "BLAKE2: simpler, smaller, fast as MD5" by Jean-Philippe Aumasson, Samuel Neves, Zooko Wilcox-O'Heam, Christian Winnerlein

Filename: nguyen_Fast_CRC_2012.02.20.pdf "Fast CRCs" by Gam D. Nguyen Naval Research Laboratory

Filename: md5_crc32_hack.zip "How to Break MD5 and Other Hash Functions" by Xiaoyun Wang and Hongbo Yu (I include sample data and PAR files to test.)

mdnahas commented 5 years ago

My opinion was that UTF-8 support was discussed and approved by Peter B. Clements and a bunch of us and approved. I may add that to the "spec" section of the new webpage as a 2.0.1 spec.

The list of use cases include: 1) streaming = generate recovery data while input is in transmission; use the recovery data to fix missing slices after transmission finishes. 2) easier single-pass file recovery (which is very similar to streaming) 3) incremental operation = ability to add source files not recalculate the entire recovery file 4) changing names of source files 5) support for more blocks And, of course, faster encoding and decoding. And a fix to Plank's Reed-Solomon encoding.

I'll have to think about "repair Par2 files" as a use case. I understand the desire for it. If you're transmitting files, you want to restore them to the original state before re-transmitting them. If you've done a backup, you'd like to restore every damaged disk image to the original state before re-burning it. I've usually assumed clients have the freedom to determine how many times to repeat the un-recoverable packets (and their placement in the files). This can probably be accomplished by a "suggested file layout" without a strict requirement.

I don't know about fixing small errors, like adding/dropping/changing a byte. I'm just not sure how to do that - especially adds and drops. It feels like a separate file format should be used for that.

I've already discussed how to stream with Par2. It takes some guesses, but I think it would work in practice. The only change I could even suggest was a packet that hinted about the exponents of recovery packets and, even then, it would only be a hint.

The big stumbling block is the changing filenames and incremental changes. The easiest fix here is just to open up more freedom in the Par2 standard. Do not require (always) for the FileIDs to be listed in sorted order in the main packet. We can consider if we also want to allow FileIDs or the RecoverySetID to be something other than its current definition, but changing the main packet order is necessary and sufficient. Those changes can done by a "Par version 2.0.2" amendment. And we could include the suggested file layout with it too.

There are reasons to go to a "Par3": more blocks, fixing Reed-Solomon, and replacing MD5 with a faster/better-suited hash. I think adding LDPC is the easy way to get to more blocks. And it would keep the Reed-Solomon matrix small, if we used Plank's correction which requires sloooow Gaussian Elimination. (I don't understand the FFT version yet.)

But simple changes to Par2 might get us a few more use cases for people.

animetosho commented 5 years ago

Yeah, those small changes to PAR2 could definitely open things up a little, and there's a good chance that existing PAR2 clients would already be fully compatible with them anyway.

I found a page where someone else was investigating a PAR2 successor, which may have some ideas+thoughts (haven't really read through it).

Some ideas I just thought up for a "PAR3" (may not be a good idea, just mentioning them):

filename compression: this probably will only be useful for those dealing with lots of small files, and probably not worth the complexity. But maybe if we have some directory structure, we can get a fair portion of the benefits of size reduction without much additional complexity (I recall that the full path needs to be specified for each file in PAR2, so by eliminating common path elements might be useful)
define how empty (0-byte) files are handled? Some clients place them in the recovery section, others place them in the non-recovery section. For the sake of consistency, it may be worth making a decision on that in the spec?
extensibility: I presume that this will be included anyway, but including algorithm indicators (like what erasure scheme is used, or what hash algorithm is used) can help extensibility. Another area may be allowing for optional fields, such as additional file properties or the like, should a use case ever come up, which can aid with new features and backwards compatibility
removing files from a Parchive? I imagine this to be similar to adding files, but this does introduce the notion of "holes" (missing input slices) in the input, which I don't think the current PAR2 would handle well

User maosredbrook gives the interesting idea of a RAR-esque usage of storing input files in Parchive with optional encryption. PAR2 already supports embedding input slices, though I'm not sure how useful that feature is. Another option is that, since decoders have to scan for packet headers, one can embed (concatenate onto the end) a PAR2 archive into another file. This can be used to, say, tack on recovery information to an archive which doesn't support repair (e.g. 7z), and the file still remains usable (as long as the application doesn't make assumptions on what's at the end of the file). "Extracting" the file would just be like stripping off the PAR2 packets off the end.
This embedding also allows an application to use some existing encryption technique without Parchive needing to support it (I don't feel that encryption should be part of Parchive).
If it's not there already, do you think it'd be worthwhile mentioning something about embedding in the Parchive spec?

Yutaka Sawada writes:

By the way, PAR2 might try to solve the max 32768 blocks (or files) problem. They are optional "Packed Main packet" and "Packed Recovery Slice packet". If new spec merges such idea in standard packet, there will be no limit of blocks (or files). For example, "Main packets" includes "Subslice size" always, as if it was a packed one. It needs to consider how are slice and subslice relation and usage.

Small improvement is better than nothing. I could not solve a dilemma; fast speed vs recovery performance. (For example, LDPC is very fast, but requires more recovery data.) If someone knows a problem of current PAR2, or comes up with a good idea, he may post his thought on the par2cmdline page. I expect that Michael Nahas will realize our hope.

My understanding of the packed slices feature is that it doesn't increase the number of available slices in any way, rather, the only thing it does is allow input files to be more tightly packed, since they don't have to be padded to a full slice (only to a subslice). If my understanding is correct, one could always just set the subslice to 4 bytes and gain maximum efficiency with packing files, with little downside (other than some additional complexity in decoders?). Adding subslice hashes could make the feature more useful as smaller erasures can be identified, and actually gives a penalty for using tiny (i.e. 4 byte) subslices. It's not quite the same as enabling more slices overall however (which can be more useful in cases when you're encoding input of unknown length).

mdnahas commented 5 years ago

The comment about removing files (and leaving some coefficients missing) is a good note. I hadn't thought of that.

0-length files are within the spec. Clients can put them either in the recovery set or the non-recovery set. Either way of doing them is fine. The non-recovery set was only added for backwards compatibility, so I'd prefer including them in the recovery set. The only bad way of handling them is the way that par2cmdline does - which is not at all. They get ignored!

As for compression, encryption, permissions, many files, etc., I'm an old UNIX junkie, so I think there are many other programs that handle that better. gzip, gpg, tar, etc.. I'd rather have someone run many tools that do each thing well, rather than build it all into Par.

I really wished some client had implemented the "input slices" optional feature. Many large files are already compressed. Many posters on Usenet were using RAR to split up a file and not for compression/permissions/etc.. I was really hoping a Par client would eliminate the use of RAR. But, as before, I wouldn't include encryption. There are other programs that do that better. ... if you we do a Par3, what do you think of making that a required packet type?

animetosho commented 5 years ago

Many posters on Usenet were using RAR to split up a file and not for compression/permissions/etc

Which in my opinion, is completely pointless. No-one has actually been able to give me a sound explanation of why files need to be split at all, as yEnc already splits up files into articles, and another layer of splitting seems completely unnecessary.

Splitting can have uses if putting files on fixed size media (CD/DVDs) and the like, but I don't see any requirement for using RAR to do that, particularly if you're not compressing.

I think the Usenet case is more due to legacy and users not updating their knowledge or processes, rather than anything which still makes sense today, on a more technical level.
RAR, however, is used for obfuscation/encryption purposes, in which case, embedding files in PAR2 may not address the needs of uploaders.

if you we do a Par3, what do you think of making that a required packet type?

I suppose the packet is useful if someone needs it, but I'm struggling to see an actual use case where it's particularly beneficial.

Embedding source files turns Parchive into an archive format, at which point, you're competing with TAR/ZIP and the like for functionality, without many of its features (timestamps, permissions etc).
I think it would be more fruitful to promote embedding Parchive into other formats instead, in which case, Parchive just adds redundancy to any arbitrary file. For example, 7z archives don't support recovery, but if one appends a few PAR2 packets onto the end, they could protect against corruption in the file. If there's no corruption, the 7z archive is usable as is.
This is the benefit of embedding Parchive into other files, rather than embedding files into Parchive - the original files are usable without the need for extraction (assuming that the application isn't too strict about what's at the end of a file).

Various other archive formats do support embedding as well (e.g. for self-extracting executables), but because PAR2's input file slice packet requires the file to be stored non-contiguously (broken into slices), it won't be recognised without an application extracting the files from the PAR2 itself.
One could always just make up a new packet type and embed the whole file contiguously to enable this sort of embedding though.

mdnahas commented 5 years ago

For anyone tracking this thread, I just pushed a HUGE change. It creates "libpar2", a library that anyone can include in their code. The commandline program, "par2", just calls the library.

Pull request is here. With a description of all the changed. It fixes a number of bugs (including one that didn't match the spec!) and minor features.

I only have two computers to test on and neither are a Windows machine. I'd love if everyone could download the code and make sure it compiles on your machine.

animetosho commented 5 years ago

Nice work!

mdnahas commented 5 years ago

I'm done all changes to this pull request. It's ready to merge, if someone can run it with Microsoft Visual C++ and run the tests on Windows.

Pull Request

animetosho commented 5 years ago

Compiling seems to fail on MSVC 2017 with linker errors. Adding src/libpar2.cpp /.h to the project (par2cmdline.vcxproj) fixes this and it compiles fine.
I'm not sure how to run tests on Windows, as they seem to be shell scripts. I tried running a chain of tests/test1 && tests/test2 && ... via an MSYS bash shell, and it seems to run them all, though I'm not sure if the tests ever return a non-zero return value on failure.

mdnahas commented 5 years ago

@animetosho Thank you for running it!!

@animetosho Did the last test run? The && operator will only run the program on the right if the program on the left exits with code 0. Maybe you could do: tests/test1 && tests/test2 && ... && tests/testN && echo SUCCESS and check if it prints out "SUCCESS".

I'll see what I can do to the unit tests to make it clearer which ran and success/failure.

Can you attach your par2cmdline.vcxproj, so I can add it to the fork? I'm assuming it is just adding 2 lines in the obvious location, but that's just a guess. If you can't attach the file, commit it locally, run "git format-patch -1 HEAD" and then post the contents of the patch it generates.

animetosho commented 5 years ago

All the tests did run through like that, it's just that I don't know whether a failing test still returns 0 or not.

I've attached the project file. It just adds the two files, and doesn't attempt to, e.g. build a separate library and CLI executable. par2cmdline.vcxproj.gz

Compile log:

1>------ Build started: Project: par2cmdline, Configuration: Release Win32 ------
1>C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\Common7\IDE\VC\VCTargets\Platforms\Win32\PlatformToolsets\v141_xp\Toolset.targets(39,5): warning MSB8051: Support for targeting Windows XP is deprecated and will not be present in future releases of Visual Studio. Please see https://go.microsoft.com/fwlink/?linkid=2023588 for more information.
1>commandline.cpp
1>z:\par2cmdline-library_dev\src\commandline.cpp(173): warning C4244: 'argument': conversion from 'u64' to 'u32', possible loss of data
1>z:\par2cmdline-library_dev\src\commandline.cpp(1268): warning C4244: 'argument': conversion from 'u64' to 'u32', possible loss of data
1>crc.cpp
1>creatorpacket.cpp
1>criticalpacket.cpp
1>datablock.cpp
1>descriptionpacket.cpp
1>diskfile.cpp
1>filechecksummer.cpp
1>galois.cpp
1>libpar2.cpp
1>mainpacket.cpp
1>md5.cpp
1>par1fileformat.cpp
1>par1repairer.cpp
1>par1repairersourcefile.cpp
1>z:\par2cmdline-library_dev\src\par1repairersourcefile.cpp(50): warning C4244: 'argument': conversion from 'u16' to 'char', possible loss of data
1>par2cmdline.cpp
1>par2creator.cpp
1>z:\par2cmdline-library_dev\src\par2creator.cpp(331): warning C4244: 'argument': conversion from 'u64' to 'const unsigned int', possible loss of data
1>z:\par2cmdline-library_dev\src\par2creator.cpp(345): warning C4244: 'argument': conversion from 'i64' to 'const unsigned int', possible loss of data
1>z:\par2cmdline-library_dev\src\par2creator.cpp(355): warning C4244: 'argument': conversion from 'i64' to 'const unsigned int', possible loss of data
1>par2creatorsourcefile.cpp
1>par2fileformat.cpp
1>par2repairer.cpp
1>z:\par2cmdline-library_dev\src\par2repairer.cpp(1220): warning C4244: 'argument': conversion from 'i64' to 'const unsigned int', possible loss of data
1>z:\par2cmdline-library_dev\src\par2repairer.cpp(1327): warning C4244: 'argument': conversion from 'u64' to 'const unsigned int', possible loss of data
1>z:\par2cmdline-library_dev\src\par2repairer.cpp(1333): warning C4244: 'argument': conversion from 'i64' to 'const unsigned int', possible loss of data
1>z:\par2cmdline-library_dev\src\par2repairer.cpp(2649): warning C4244: 'argument': conversion from 'i64' to 'const unsigned int', possible loss of data
1>z:\par2cmdline-library_dev\src\par2repairer.cpp(2650): warning C4244: 'argument': conversion from 'i64' to 'const unsigned int', possible loss of data
1>z:\par2cmdline-library_dev\src\par2repairer.cpp(2658): warning C4244: 'argument': conversion from 'i64' to 'const unsigned int', possible loss of data
1>Generating Code...
1>Compiling...
1>par2repairersourcefile.cpp
1>recoverypacket.cpp
1>reedsolomon.cpp
1>verificationhashtable.cpp
1>verificationpacket.cpp
1>Generating Code...
1>par2cmdline.vcxproj -> Z:\par2cmdline-library_dev\Win32\Release\par2.exe
1>Done building project "par2cmdline.vcxproj".
========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========

mdnahas commented 5 years ago

I merged your Microsoft project file. I fixed the code for the warnings - one was a bug I had introduced.

The && operator in bash checks the exit code. It will only run the code on the right if the program on the left returns 0. (Which is what each test should do if it passes.) So, "echo SUCCESS" is only run if every program to its left had an exit code of 0. (If you want to run two command sequentially and always have the second run (independent of the exit code of the left one), I think you use the semi-colon ; operator.

mdnahas commented 5 years ago

I started looking into Tornado Codes. I think there is an error in the paper which affects the code's performance. That is, the codes still work, but a typo makes the codes look much better than they are in practice.

The error is on page 7 of the Paper It's in the 4th paragraph of Section 7, which starts "The left degree sequence..."

You start by choosing d. The higher the value for d, the more space-efficient the algorithm, but d+1 must be less than n, the number of message symbols (a.k.a. input slices). In the paragraph, he defines H(d) and says it approximates ln(d). Then he gives the formula for lambda where lambda_i is the portion of message symbols (input slices) that are associated with i check symbols (redundancy slices). So, lambda_2 might be 20%, which means 20% of input slices can be recovered by exactly 2 redundancy slices. The formula for lambda_i is 1/(H(d)(i-1)). All that looks correct.

The problem is the next sentence, "The average left degree a_l equals H(d)(d+1)/d." I think that is wrong. The formula to compute the average is: sum from 2 to d+1 of i lambda_i.
= 21/(H(d)(2-1) + 31/H(d)(3-1) + ... + (d+1)1/(H(d)(d+1-1) = (2/1)/H(d) + (3/2)/H(d) + ... + ((d+1)/d)/H(d) = (1+1 + 1+1/2 + 1+1/3 + ... + 1+1/d)/H(d) = (1+1+1+...+1 + 1+1/2+1/3+...+1/d)/H(d) = (d + H(d))/H(d) substituted the definition of H(d) = 1 + d/H(d) which is definitely not equal to what Luby has, H(d)(d+1)/d.

When d=100, H(d) is about 5.18. Luby's value is 5.24. My calculation is 20.28. For larger values of d, the separation gets wider. For d=10000, Luby's value is 9.79 and mine is 1022.

Why is this important? Because the average left degree has to equal the average right degree and the average right degree is basically "how many input slices are used to create each recovery slice". If we want Par3 (or Par4) to be disk efficient, we want a large value for d. But if we want it to be fast, we also want each recovery slice to be computed from only a few input slices. And if my formula is correct and H(d) is approximately ln(d), then that value goes up with d/ln(d), which is a big value.

I've included my code in this post. At the start is a big comment describing each variable name in the paper. That may help if you try to read the paper. The actual code to compute H, lambda, and the average is about 60 lines.

tornadocode_standalone.cpp.txt

I'll spend another day looking at Tornado Codes, but we might look into other LDPC codes.

mdnahas commented 5 years ago

I was wrong. The paper used a strange notation. lambda was not the portion of nodes with out-degree i, but the portion of EDGES that are connected to nodes with out-degree i. So I was multiplying by i when I should have been dividing.

For 1000000 input slices, the average number of input slices used to compute each recovery slice is 14. Which is the nice low value that we want!

tornadocode_standalone.cpp.txt

mdnahas commented 5 years ago

I've been looking into Tornado Codes and it doesn't look good. I'm generating ones with 100 to 32000 inputs and 20% to 5% of recovery data and seeing how often they can recover the original file.

For 100 input blocks and 20% recovery data (20 parity blocks), in 1% of cases we cannot repair 3 or fewer errors. Some graphs encountered errors where they could not repair 1 block. On average, we can repair 13 errors, which is far fewer than 20.

For larger numbers, I looked at 10,000 input blocks with 5% recovery data (500 parity blocks). In 1% of cases, we cannot repair 84 errors. (The worst I saw was 42 errors that it could not repair!) On average, we can repair 415 errors.

I've seen a paper that tries to fix some of these problems with "small" graphs, by making sure every input has trivial coverage and that parity blocks don't overlap. But I also found a paper that says Tornado Codes have a high overhead until you get to above 100,000 blocks. James S. Plank paper on LDPC with small graphs

I had already been thinking that Par3 (or whatever the next version is called) should support all linear codes. That is, any matrix of inputs to recovery blocks using GF(16 bit). The nice thing about that approach is that it would support Tornado Codes, Reed-Solomon Codes, and some others. I'm now thinking that we may need to find our own style of linear code. Tornado Codes don't do well with "small" numbers of input blocks because they use XOR. We might be able to get some better performance using GF(16 bit) operations. I haven't seen any paper doing that. So, this may involve some research.

I've attached my test program, if you want to see what I'm doing. "a.out 10000 5 100" creates a graph with 10000 inputs, 5% recovery data, and simulates recovering damage 100 times.
tornadocode_standalone.cpp.txt

mdnahas commented 5 years ago

This might be the paper we want. It says that an n-by-n random matrix where each row has log(n) non-zero elements usual has rank n minus a constant. And if we use GF(q), the probability that the rank is less than n-k is some constant times 1/q^k.
Paper

In fact, we're probably using this theorem right now. Because of James S. Plank's mistake, we aren't using a Vandermonde matrix but some other matrix. That is, a random-ish matrix with at least log(n) non-zero values per row.

My thinking is that we can build recovery out of random matrices that are n-by-n whose rows have log(n) non-zero elements, which have rank n-k, and add a few rows that have n elements, derived from the Vandermonde matrix. Then we'll only have wasted k parity blocks and gotten n/log(n) times faster.

This doesn't avoid the fact that inverting a large matrix is expensive. Tornado Codes do nicely because their recovery mechanism only looks at one recovery block at a time. I'll see what I can do. Perhaps we can build a hierarchy of small random matrices.

animetosho commented 5 years ago

For those following along, I noticed that Yutaka posted another comment a while ago:

I came up with a simple method to construct non-random generator matrix. It's based on Reed-Solomon Codes, but acts as non-binary LDPC. I feel that it may be good for our Parchive usage. But, I'm not sure that it's good or not. I put the example of my idea on HTML format here for future note.

Split_Reed-Solomon.htm

mdnahas commented 5 years ago

I'm still looking at Parchive Version 3. I hit on a crazy off-the-wall idea.

I decided to look at other file formats. Wikipedia has a list.

Many are proprietary or single-platform, but a few are not. An interesting one is Java Archive files with the ".jar" extension. That format, if you don't know, is just a ZIP file. It just has a different extension and the only difference is that it contains specific files (e.g., "META-INF/MANIFEST.MF"). I had the crazy idea that it might be useful if Parchive Version 3 also fit another file format. A likely candidate is "tar". Tar file format

The tar file format consists of blocks of 512 bytes. The original tar file had 3 kinds of blocks. Header blocks contain the file name, file size, file type, permissions, owner, etc.. Data blocks hold the content of files. The last kind of blocks are end-of-file indicators and are just filled with zeros.

The tar file format was upgraded in 1988. The new USTAR file format added a magic value ("ustar\000"). It also changed the file type ("normal", "hard link", or "soft link") to support 256 types of header blocks and reserved some for custom extensions. We could put Parchive data either into files with special names or into header blocks marked with a custom extension.

I'm definitely saying this is a crazy idea. But I found out that "tar" doesn't contain checksums for its files, only a checksum for the data in the header block! You would think that a file archiver would do that, but it doesn't. "tar" is out of date. Reusing tar's format provides an easy way to include file permissions and file data into the Parchive format. Another benefit of reusing the tar format (especially if we did it using files with special names) is that we can debug using the tar program and anyone who doesn't have access to install a Parchive 3 client could still get access to files using the common utility, "tar".

There are many arguments against this idea. I've always thought of Parchive as a unix-pipeable program. I always thought of users doing something like "tar files* | compress | par > backup.tar.Z.par". Reusing the tar format doesn't really make sense in that sequence, because par really should only be focused on one thing, redundancy, and other programs like "tar" should be focused on archiving (grouping files, storing permissions, etc.).

The other argument is that we don't plan on implementing all of "tar"'s file format right away. It adds a lot of headaches for client writers. We might mess up some detail of "tar" and end up with a wacky file format that isn't actually compatible. We should specialize in redundancy, which we're good at, and leave "tar" people to fix their own problems.

I've pretty much argued myself out of not using the tar file format. It is far easier for Parchive version 3 to just reuse the message format that I designed for version 2. But the tar format is an intriguing option and I thought I should put the idea out there and see if it sparked any discussion.

zvezdochiot commented 5 years ago

@mdnahas say> I also plan on writing a diff tool that can compare Par files to make sure the packets are bit-for-bit identical. I'll use this to make sure that my changes haven't affected the program's output for version 2 of the specification.

https://github.com/madsen/vbindiff

diff <(xxd origin.par) <(xxd new.par)
cmp -b -l origin.par new.par

animetosho commented 5 years ago

Doesn't really sound crazy - in fact, a key reason for using RAR is that it's an archive format which has in-built error correction, and is convenient to use.

A key problem with forcing an archive format is that not everyone wants to embed the original files with the recovery data.
As for TAR, it caters for the Unix crowd, but isn't as nice on Windows. The other issue is that TAR is often compressed afterwards, but that would mean the recovery data would need to be compressed as well, which is very likely undesirable (can't get to recovery data if compressed data gets corrupt).

I always thought of users doing something like "tar files* | compress | par > backup.tar.Z.par".

There are downsides to this notion of "do only one thing". A common issue with .tar.gz, for example, is that you often need to decompress the whole thing even if you only want to extract a single file. On the other hand, combining the archive and compression into a single format, such as ZIP or 7Zip, may allow one to get a file listing or extract select files without needing to decompress the whole archive, because it has the flexibility of applying compression selectively.

Parchive may not suffer quite the same issue, as, if you have data corruption, it's probably more acceptable for recovery to take some time. But it may be argued that deeply embedding into a format, rather than being a standalone thing can have its benefits.

My opinion is still encouraging the notion of an embedable format. That is, Parchive doesn't specify any archive format, but supports the notion of files being embedded into archives. To use your tar/compress example, perhaps:

tar | gzip > output.tar.gz
par --append output.tar.gz
# or maybe
tar | gzip | par --append > output.tar.gz

(I'm not sure how well GZip handles having arbitrary data appended, since it does have a footer, but I suspect many implementations handle it fine)
For formats which don't handle appended data well, but do support some way of embedding arbitrary data, embedding could be enabled via explicit means in the format, e.g.

some_archiver create files.archive
# write PAR data to some 'comment' or unused packet type
par --embed-as-some_archiver-packet files.archive

This handles a number of issues:

PAR still only has to care about recovery
doesn't force an archive format which may be suboptimal for one's OS
supports the notion of archive format, plus standalone
simple to implement

PAR2 probably already supports this, as I suspect clients have to scan for recovery packet headers anyway, so it probably adds no extra complexity. The only thing I'd change is perhaps explicitly add a note to the specification that embedding is a potential use-case for Parchive. par2cmdline could also support something like an --append flag, and maybe even support creating TAR archives with embedded recovery data (meaning that a particular archive format isn't really officially supported, but rather exists as a convenience function in a client).

Another idea may be to have some sort of zlib like library which makes it easy to embed recovery data into archive applications (or protocols or other applications), and hope archivers adopt recovery support. I feel that this may be wishful thinking though, both in terms of adoption, and technical difficulties of trying to allow this to work in a nice, simple and efficient streaming manner.

https://github.com/madsen/vbindiff

Packets in PAR2 are strictly defined, but the ordering, or how much duplication is done, isn't. This means that you'd want a specific tool built for PAR2 comparisons, rather than a generic binary compare tool.

mdnahas commented 5 years ago

For verifying par2cmdline output, I used the Unix command-line utility "cmp". Since I was comparing output from the same program (before and after my changes), it worked fine.

I'm going to define some things:

an "archiver" converts file owner, permissions, and other metadata into a data file.
a "checksummer" adds error detection
a "grouper" puts many data files into one file
a "compressor" shrinks a file
an "encrypter" provides security
our "ecc-er" adds redundant data. (ECC = Error Correcting Code)
a "splitter" break many data files into multiple equal-sized files

Many programs do multiple things. So, tar is doing "archive" and "group". Par2 currently does "checksum", "group" and "ecc" Some programs do only one thing. For example, gzip and bzip2 only do "compress".

It's clear that "archive" needs to be done last during decode and, therefore, first during encode. "archive" is also OS-specific. So "tar"'s archives store Unix-style permissions and owners, while you need something else for Windows.

The step before "archive" on the decode side has to be "checksum". This is so any bug in any other step is detected. It can't be after "archive", because then errors to the metadata aren't detected. (It might be good to have 2 checksums and checksum the metadata before the "archive" decode and another for data after "archive" decode.) If it is a single checksum, on the encode side, "checksum" is the second step.

It's not clear which order "compress" and "group" come in. Different files do better under different compression algorithms (or no compression). But, that's also the case for different parts of a file. (E.g., the code part of a binary could use a different algorithm than then data part.) If you want to do random access (e.g., extract a single file), it's probably best to "compress" individual files and, on the encode side, "group" after "compress".

When you "split" files, it's probably best to do that step last. It's usually tied directly to a storage or transmission limitation.

The "ecc" is probably best to do before "split". As close to when the file is stored or transmitted.

Some formats do try to combine all of these steps. RAR is proprietary. ZIP is open and does everything, except "ecc". (Wikipedia seems to disagree if 7z does or doesn't archive.)

Since "archive" is the first step and "eec" is near the end, it probably doesn't make sense to combine them. We do some form of "split", but that could be done better.

ZIP's format does allow data to be stuck in the middle. ZIP files usually have the data files at the front and the directory structure at the back. It shouldn't be too hard to insert recovery data in front of the directory structure.

P.S. After writing this post, I'm damn tempted to redo all the steps above using a simple JSON + binary format for checksums, archiving, splitting, grouping, ...

zvezdochiot commented 5 years ago

@animetosho say> Packets in PAR2 are strictly defined, but the ordering, or how much duplication is done, isn't. This means that you'd want a specific tool built for PAR2 comparisons, rather than a generic binary compare tool.

Maybe:

diff <(xxd -p -c 1 origin.par) <(xxd -p -c 1 new.par)

mdnahas commented 5 years ago

I'm close to writing the specification for a new version of PAR. I'd love everyone's thoughts on this design.

The goals would be: 1) support very large data sets (> 1TB) 2) support new Low Density Parity Check codes, like Tornado Codes 3) support Reed-Solomon correctly 4) support adding and removing files 5) support filename changes I have thought a little about streaming and single-pass encoding. I'd love to hear from client authors about where the problems are and what changes might enable this.

The "big idea" for Par3 would be to support any linear code. Linear codes are where the recovery blocks are the sum of the input blocks times factors, using Galois Field arithmetic. So: recoveryblock_i = a_i,0 input_0 + a_i,1 input_1 + a_i,2 * input_2 + ... Where "recoveryblock_i" is the ith recovery block, "input_j" is the jth input block and "a_i,j" is the Galois Field factor used to multiply input_j when generating recoveryblock_i.

Linear codes include Reed-Solomon code, Tornado Codes, and others. A particular code is determined by the factors. But all can be decoded using the same algorithm, Gaussian elimination. By supporting any linear code, we allow the clients a lot of freedom in how to encode the Par3 file.

One code I like is a random sparse matrix. Each input's factor would be 0 most of the time, but have log(n) places where it is non-zero. This would be fast to encode and almost as good as Reed-Solomon. (Par2's current algorithm is not Reed-Solomon, but a random-ish dense martix.)

I would like to use Tornado Codes, but they have some problems. See: "On the Practical Use of LDPC Erasure Codes for Distributed Storage Applications" by James S. Plank and Michael G. Thomason "Tornado Codes for Archival Storage" by Matthew Woitaszek "Matlab Implementation of a Tornado Forward Error Correction Code" by Alexandra Noriega Of particular interest is Figure 5 of Plank and Thomason. It shows that Tornado Codes work well when the number of input blocks is over 100,000, but not when the number of input blocks is 1,000.

I've attached my code for generating random Tornado Code graphs and evaluating them. Tornado Codes's factors are either 1 or 0. That is, the recover data is generated by XORing a handful of input blocks together. My C++ code also evaluates the case if, instead of being restricted to 1 or 0, the factors are random value from a Galois Field. It makes the encode/decode time longer, but more missing slices can be recovered. It might be useful when the number of input blocks is less than 100,000.

tornadocode_standalone.cpp.txt

For Par3, I'd like to keep as much the same as Par2 as possible. We already have working, tested code for Par2. So, I'd like to keep Par2's packet format. Also keep using MD5 and CRC32. I think we can keep GF16. (Although we could revert to GF8 if we wanted.) We can probably keep the same File Description packet and the Input File Slice Checksum packet.

To support every linear code, the new Recovery Slice packet will need to contain the recovery slice data, the list of input blocks used to compute it, and the GF factor associated with each input block. Depending on how we list the input blocks and the factors, it could consume a lot of space. For example, if we did a Reed-Solomon implementation with 10,000 input blocks, each Recovery Slice packet would have to include the 10,000 factors used to generate the recovery data. For Tornado Codes, which averages something like 12 input blocks for each output block, it would be much smaller.

We need to figure out how to identify the input blocks. We could identify each factor with a FileID and offset within the file. That is, there would be no Main Packet and each Recovery Slice Packet would refer directly to each file using the FileID. That would take up a lot of space: 16 bytes for the FileID, 8 bytes for the offset, and 2 bytes for the factor. 26-bytes times 10,000 input blocks would be 260,000 bytes of overhead for each recovery slice. That's a lot.

That can be shrunk, for example by creating a new Main Packet, but it can only be shrunk so much. If we want to support every linear code, we need (at the very least) a factor associated with each input slice. That's 2 bytes, so we would have 20,000 bytes of overhead for each recovery slice. That's significant, but livable.

I expect the default usage will be the sparse random matrix with 256 recovery blocks. If the input file contains 10,000 slices, that would mean 10,000*log(256)/256 factors per recovery block or 313 factors per recovery block. If we kept the large 26-bytes per factor, we'd have 8138 bytes per recovery block. The minimum of 2-bytes per factor would be 616 bytes.

For Par3 to support adding and removing files would be simple. The removed FileIDs would just not show up in the list of FileIDs used by Recovery Slice packets. If a file was modified, it would need to have a new FileID. It could still share input file slices with the older version of itself (assuming the data was aligned on block boundaries).

Supporting filename changes would be harder. We could say that the FileID would remain the same --- there would just be a new File Descriptor packet. We'd need a tie-breaker if there are two File Descriptor packets with the same FileID and different names.

I'd love everyone's feed back:

What do you think of the goals? Should we try to support streaming/single-pass? Are there any applications that I missed?

What do you think of supporting every linear code? Is the flexibility worth the overhead?

What do you think about keeping Par2's packets? MD5? CRC?

Is 26-bytes per factor too much overhead? What amount of overhead do you think users will accept? (Remember, users can always increase blocksize to lower overhead but reduce the granularity of errors we recover from.)

How do you think this would work on DVD collection that has 1TB of data in 200 files? How do you think this would work on a photo collection that has 300GB of data in 100,000 files? How do you think it will work for someone who created a Par3 file for his backup last month and runs it again this month?

trapexit commented 5 years ago

If there weren't any file size or count limits then one could use it similar to SnapRaid. I've been looking for a solution to managing bitrot in large media collections and started playing with building a tool to help with the workflow around creating, updating, maintaining par2 files for such purposes but a number of limits of the par2cmdline tool get in the way. The filename limits as well as the ability to select where the output is placed.

mdnahas commented 5 years ago

I agree par2cmdline's args has some limitations. I saw that when I cleaned up the code. I decided not to fix that, but at least added unit tests so it could be fixed.

Par3 would not be a full archive solution. It needs at least some other programs, like "tar" or "zip" to handle the other steps. (Above, I listed the steps as "archiver", "checksummer", "grouper", "compressor", "encrypter", "ecc-er" and "splitter".)

BTW, I forgot to say that, if we support every linear code, then Par3 can be used as a splitter. If you want to put the data from a single input block, you just create a Recovery Slice packet where all input blocks have a factor of 0, except one which has a factor of 1. Then the Recover Slice packet's data is the same as the input block.

animetosho commented 5 years ago

Supporting any code is an interesting idea, although I'm trying to think of how useful it'd be to end users, more specifically, how it'd be exposed to them. Perhaps some option to choose between speed and recoverability?

For Par3, I'd like to keep as much the same as Par2 as possible

From the description, this sounds like PAR2 + support for arbitrary matricies, and the ability to edit the archive a bit. Is this correct?

Also keep using MD5 and CRC32

I'm not sure what the exact purpose of MD5 was, but in 2019, I'd have thought something like BLAKE2 would be more attractive, faster (at least for single buffer) and more secure. Or perhaps even the SHA* family.
Probably would select CRC32C over CRC32 due to better detection properties. I don't know if CRC64 is worth it.

I imagine changing these shouldn't be hard, as there's already various implementations for these out there, and the MD5/CRC32 code is probably 3rd party in various clients anyway.

I think we can keep GF16. (Although we could revert to GF8 if we wanted.)

If you're going to introduce such a flag, how about covering GF32, maybe even GF24 or the like (e.g. a 2-bit "word size" value)?

That would take up a lot of space: 16 bytes for the FileID, 8 bytes for the offset, and 2 bytes for the factor. 26-bytes

I assume 28 bytes, if you're going to keep the PAR2 requirement of 4 byte alignment?
26 bytes per input per recovery block does sound unacceptable to me, if we're talking many input blocks. For your example of 10k input blocks, that's 260KB per recovery block, and assuming 10% redundancy, 260MB overhead, just for this metadata.

That's 2 bytes, so we would have 20,000 bytes of overhead for each recovery slice

Perhaps some compressed representation could work. For dense matricies, you could have some flag which indicates all input blocks are used, meaning no (or little) overhead per recovery slice. Maybe the flag could instead be an include/exclude toggle (i.e. 0=include these inputs, 1=include all inputs except these), which would limit your example scenario to 10KB in the worst case.
I imagine for typical cases, if not using a dense matrix, there won't be many inputs for a recovery slice, so this keeps the overhead low.

The removed FileIDs would just not show up in the list of FileIDs used by Recovery Slice packets.

Would this require rewriting much of the PAR file on disk, or would the format support having gaps or "invalid File IDs"?

We could say that the FileID would remain the same --- there would just be a new File Descriptor packet

If File IDs are no longer tied to input slice coefficients, the only purpose they seem to serve is to link the FileDescriptor and InputChecksum packets.
From the PAR2 specification, the only other references are the Unicode filename packet (should hopefully be redundant in PAR3) and embedding input slices into the PAR file itself or having external recovery data (I haven't seen a single PAR2 client implement these, and can't really see the use case for such, hence I'd have no hesitation in dropping the feature).

Considering that the FileDescriptor and InputChecksum packets exhibit a 1:1 relationship, how important are File IDs really? I mean, you could feasible combine the two packets into one and eliminate the need to link the two together, meaning that File IDs aren't even needed (assuming you don't want to retain support for embedding input slices and the like). I'm not too sure about the non-recovery set, and whether they need a InputChecksum packet (PAR2 spec doesn't seem to say anything about that) but I'm sure it could be flagged somehow.

Of course, input slices would still need to be identified, and you'd need to link them back to an input file (or even >1 files?), but a small integer may be sufficient for that purpose.

I have thought a little about streaming and single-pass encoding

For create, I think clients have a lot of freedom, so it's not too difficult. The only thing that occurred to me is needing to know the file size up front, as streams often don't provide that info. Because, in PAR2, it's part of the File ID, and File IDs have to be ordered to determine the input coefficient, you can't really do any calculations without knowing the size (exception: if there's only 1 file).
Since this proposal allows coefficients to be arbitrarily chosen by the encoder, it effectively breaks the dependency, so makes it a non-issue.

I haven't thought too much for decoders, but it doesn't seem good. Since the input coefficients are stored with the recovery slices, the decoder would effectively need to know all such coefficients used, from all the recovery slices it'll need, before it can even start doing any computation. It can't simply just grab some metadata packet from an index file (to know the slice size and checksums to detect corruption) - it would need to somehow grab and buffer a particular amount of recovery data.
Maybe if the coefficients are stored in a separate packet, perhaps in the index file, it might be easier?

In any case, I'm not sure if anyone's interested in building a streaming decoder, though it might be nice for the format to support it, as I can see a benefit in doing so.

How do you think this would work on DVD collection that has 1TB of data in 200 files?

Assuming a maximum of 65536 input blocks, each would have to be at least 16MB, assuming the collection is never updated or added to. If we're going to support adding files, you'd probably want to use much larger block sizes to accommodate modifications, which may be slightly unwieldy in some situations, but probably not too big of a problem for this use case.

How do you think this would work on a photo collection that has 300GB of data in 100,000 files?

If we're limited to 65536 input blocks, a single block would have to cover multiple files for this to work. PAR2 only supports this via "packed slices" which no client, that I've seen, suports.
Does PAR3 change this, or perhaps relax the need to align files to slices? I presume that if you want to support updating, you'd want files to be aligned to slices to some extent (though you could be more liberal about it, e.g. align every, say, 1GB or so).

Thanks for keeping up with this by the way!

mdnahas commented 5 years ago

Supporting any code is an interesting idea, although I'm trying to think of how useful it'd be to end users, more specifically, how it'd be exposed to them. Perhaps some option to choose between speed and recoverability?

Yes, an option would be available. Mostly, putting this flexibility in the standard prevents us from having to issue PAR4.

For Par3, I'd like to keep as much the same as Par2 as possible

From the description, this sounds like PAR2 + support for arbitrary matricies, and the ability to edit the archive a bit. Is this correct?

Yes.

Also keep using MD5 and CRC32

Yes.

I'm not sure what the exact purpose of MD5 was, but in 2019, I'd have thought something like BLAKE2 would be more attractive, faster (at least for single buffer) and more secure. Or perhaps even the SHA* family. Probably would select CRC32C over CRC32 due to better detection properties. I don't know if CRC64 is worth it.

If we were starting from scratch, I'd choose something other than MD5 and CRC32. But, since we already have that code working, I don't see huge wins for changing it.

I imagine changing these shouldn't be hard, as there's already various implementations for these out there, and the MD5/CRC32 code is probably 3rd party in various clients anyway.

I think any change is work and can introduce bugs. I don't see a big win for changing them.

I think we can keep GF16. (Although we could revert to GF8 if we wanted.)

If you're going to introduce such a flag, how about covering GF32, maybe even GF24 or the like (e.g. a 2-bit "word size" value)?

I'm not introducing a flag. I'm saying GF16 will be it. Unless you can tell me there is a big win with GF8.

This paper says GF8 and GF16 are about the same speed. (Although GF8 may be easier to implement.) And I think GF16 is better because it increases the probability of recovery.

That would take up a lot of space: 16 bytes for the FileID, 8 bytes for the offset, and 2 bytes for the factor. 26-bytes

I assume 28 bytes, if you're going to keep the PAR2 requirement of 4 byte alignment? 26 bytes per input per recovery block does sound unacceptable to me, if we're talking many input blocks. For your example of 10k input blocks, that's 260KB per recovery block, and assuming 10% redundancy, 260MB overhead, just for this metadata.

That's 2 bytes, so we would have 20,000 bytes of overhead for each recovery slice

Perhaps some compressed representation could work. For dense matricies, you could have some flag which indicates all input blocks are used, meaning no (or little) overhead per recovery slice. Maybe the flag could instead be an include/exclude toggle (i.e. 0=include these inputs, 1=include all inputs except these), which would limit your example scenario to 10KB in the worst case. I imagine for typical cases, if not using a dense matrix, there won't be many inputs for a recovery slice, so this keeps the overhead low.

Recovery requires inverting a matrix and the basic algorithm for that has O(N^3) runtime. In practice, I had trouble inverting square matrices with more than 400 rows and 400 columns. So, I expect, for both sparse and dense matrices, to be limited to <512 recovery blocks.

Yes, I hope for typical cases, there will be a sparse matrix with lower overhead. I expect each input block to be used log(NumOfRecoveryBlocks) times. So, at most log(512) = 9 times.

For dense matrices, yes, the overhead will be significant. But if they want full recovery, they'll be limited by GF16 to 2^16 input blocks. In all other cases, they'll have to increase the blocksize. So, the maximum the overhead will be is 2^1651228 bytes = 460kB. It's large but small for most disk-sized usages.

The removed FileIDs would just not show up in the list of FileIDs used by Recovery Slice packets.

Would this require rewriting much of the PAR file on disk, or would the format support having gaps or "invalid File IDs"?

Say, the original recovery set is files A and B. The user generates First.par3. Then, the user deletes file B and adds file C. And they generate Second.par3.

If the user tries to recover with just Second.par3, they can recover A and C. If they run with both Second.par3 and First.par3, they have additional information about file A and may be able to recover more of it. (They also have information about file B, which they may or may not care about.)

We could say that the FileID would remain the same --- there would just be a new File Descriptor packet

If File IDs are no longer tied to input slice coefficients, ... ... Of course, input slices would still need to be identified, and you'd need to link them back to an input file (or even >1 files?), but a small integer may be sufficient for that purpose.

This gave me a different idea. We could identify the inputs blocks used by each recovery block using the MD5 hash of the input block. Those are in the Input File Checksum Packet. They're shorter than a FileID + offset. And they don't change if you change filenames.

I have thought a little about streaming and single-pass encoding

For create, I think clients have a lot of freedom, so it's not too difficult. The only thing that occurred to me is needing to know the file size up front, as streams often don't provide that info. Because, in PAR2, it's part of the File ID, and File IDs have to be ordered to determine the input coefficient, you can't really do any calculations without knowing the size (exception: if there's only 1 file). Since this proposal allows coefficients to be arbitrarily chosen by the encoder, it effectively breaks the dependency, so makes it a non-issue.

I haven't thought too much for decoders, but it doesn't seem good. Since the input coefficients are stored with the recovery slices, the decoder would effectively need to know all such coefficients used, from all the recovery slices it'll need, before it can even start doing any computation. It can't simply just grab some metadata packet from an index file (to know the slice size and checksums to detect corruption) - it would need to somehow grab and buffer a particular amount of recovery data. Maybe if the coefficients are stored in a separate packet, perhaps in the index file, it might be easier?

In any case, I'm not sure if anyone's interested in building a streaming decoder, though it might be nice for the format to support it, as I can see a benefit in doing so.

Yes, putting the coefficients in the index file would work. It would mean storing them separately from the recovery data. It's a little more work, but I like the idea of streaming / single-pass.

How do you think this would work on DVD collection that has 1TB of data in 200 files?

Assuming a maximum of 65536 input blocks, each would have to be at least 16MB, assuming the collection is never updated or added to. If we're going to support adding files, you'd probably want to use much larger block sizes to accommodate modifications, which may be slightly unwieldy in some situations, but probably not too big of a problem for this use case.

How do you think this would work on a photo collection that has 300GB of data in 100,000 files?

If we're limited to 65536 input blocks, a single block would have to cover multiple files for this to work. PAR2 only supports this via "packed slices" which no client, that I've seen, suports. Does PAR3 change this, or perhaps relax the need to align files to slices? I presume that if you want to support updating, you'd want files to be aligned to slices to some extent (though you could be more liberal about it, e.g. align every, say, 1GB or so).

You don't need alignment. This design supports more than 2^16 input blocks, but it only has 2^16 unique coefficients for input blocks. So, if you want more than 2^16 Input blocks, you must reuse coefficients. This may mean the resulting matrix is not invertible. So, if you want to do dense matrix actual-Reed-Solomon, then you are limited to 2^16 input blocks and, as a result, 2^16 files. Hmmm...

Thanks for keeping up with this by the way! You're welcome. Sorry about the delay in reply!

animetosho commented 5 years ago

I think any change is work and can introduce bugs.

That's true, but if you stick to that, nothing ever changes and you can never move forward. I'm suggesting the change because the risk is very low, and it's a very simple specification change. Compare this to the change needed to support custom matrices, which is quite complex and has a relatively high chance of bugs from implementations.
Bug fixing is mostly a one time 'cost' anyway, by only the developers, vs paying the price of not implementing improvements, affecting all users over the next 10-20 years.

Anyway, you seem to be fairly invested in the idea of minimal change, so we may have to agree to disagree there.

I'm not introducing a flag. I'm saying GF16 will be it. Unless you can tell me there is a big win with GF8.

Ah, I see. No, I don't think GF8 has much worth for PAR2.
Keeping it to just one GF implementation keeps things nice and simple.

This paper says GF8 and GF16 are about the same speed

Actually, it says GF16 is slower, even if slightly:

w = 4 and w = 8 perform identically. w = 16 performs slightly slower, and w = 32 slower still

Also, there are many more tricks you can do with GF8, which Plank's implementation doesn't cover, so probably doesn't show up well in their benchmarks. A more apt comparison may be with Intel's ISA-L which includes a highly optimized GF8 implementation.

Some newer Intel CPUs can even perform GF8 modular multiplication in a single instruction, though you can do a fairly good job for GF16 using the affine instruction.

Having said that, I don't think the small number of blocks allowed by GF8 is worth it. If anything, with increasing data sizes, I think GF larger than 16-bits has potential utility.
CPUs will continue to get more powerful over time, storage media will grow larger, as will the amount of data people tend to use. Unfortunately, GF width doesn't automatically scale along with these so well...

So, I expect, for both sparse and dense matrices, to be limited to <512 recovery blocks.

But computing more recovery blocks doesn't hurt recovery performance. You only need to compute enough recovery for the amount of damaged blocks you have, and it's trivial for decoders to ignore recovery blocks it doesn't need.
The only downsides to having a large number of recovery blocks are the encode performance and space usage.

The other thing to consider is that many users aren't aware of all the intricacies of block size and performance. For PAR2, they'll know to select a block size large enough to keep input blocks <= 32768, because their tool will complain if they don't, but otherwise, they're likely not thinking about number of recovery blocks other than needing 10% recovery. So, for example, if they have a fair amount of data, they'll probably up the block size enough to stop their encoder from erroring (i.e. around 32k input blocks), request 10% recovery, which will give 3k recovery blocks right there.
I reckon if you look for PAR2 files in the wild, which are covering a decent sized set of data, you'll probably find many with >512 recovery blocks.

Say, the original recovery set is files A and B. The user generates First.par3. Then, the user deletes file B and adds file C. And they generate Second.par3.

I don't quite understand how this marries up with your earlier statement about FileIDs not showing up.
Anyway, if I'm understanding you correctly, this doesn't sound like supporting deletes at all, as the scenario can already be done with PAR2.

We could identify the inputs blocks used by each recovery block using the MD5 hash of the input block

Do you see a problem with having two blocks with different data but the same MD5 hash (since MD5 isn't a collision resistant hash)? PAR2 does have this problem somewhat, though most clients seem to assume that data is generally sequential.

They're shorter than a FileID + offset. And they don't change if you change filenames.

Is there any reason to not just use a 2-byte index instead? It's even shorter.

You don't need alignment. [for files]

This could make supporting updates difficult, if files are packed densely together. E.g. replacing a file with a larger file might either leave holes (act as a delete + add) or require some complex fragmentation management scheme, I'd think.

This design supports more than 2^16 input blocks

Ah, so kind of like having separate Parchives (or recovery sets), but packed into one set?

Thanks for replying!

mdnahas commented 5 years ago

Ah, I see. No, I don't think GF8 has much worth for PAR2. ... Okay. It's not perfect, but we'll keep GF16.

So, I expect, for both sparse and dense matrices, to be limited to <512 recovery blocks.

But computing more recovery blocks doesn't hurt recovery performance. You only need to compute enough recovery for the amount of damaged blocks you have, and it's trivial for decoders to ignore recovery blocks it doesn't need. The only downsides to having a large number of recovery blocks are the encode performance and space usage.

The other thing to consider is that many users aren't aware of all the intricacies of block size and performance. For PAR2, they'll know to select a block size large enough to keep input blocks <= 32768, because their tool will complain if they don't, but otherwise, they're likely not thinking about number of recovery blocks other than needing 10% recovery. So, for example, if they have a fair amount of data, they'll probably up the block size enough to stop their encoder from erroring (i.e. around 32k input blocks), request 10% recovery, which will give 3k recovery blocks right there. I reckon if you look for PAR2 files in the wild, which are covering a decent sized set of data, you'll probably find many with >512 recovery blocks.

But, if a problem happens, can they recover the damage? Let's say they have 1024 recover block and they have 1024 damaged blocks. The program has to invert a 1024x1024 matrix. And the standard algorithm, Gaussian Elimination, runs in O(N^3) time, so we're talking 1024^3 operations.

I think we can invert a 256x256 matrix in less than a minute. 512x512 will probably take 10 minutes. 1024x1024 will take over an hour. That's just a guess - we can test it if you like. But I think that most implementations should use 256 recovery blocks by default and only more if the user explicitly says so.

Say, the original recovery set is files A and B. The user generates First.par3. Then, the user deletes file B and adds file C. And they generate Second.par3.

I don't quite understand how this marries up with your earlier statement about FileIDs not showing up. Anyway, if I'm understanding you correctly, this doesn't sound like supporting deletes at all, as the scenario can already be done with PAR2.

I was mostly focusing on incremental backups. Someone has files from time T, generates a Par3 file, then modifies them a bit, and then generates another Par3 file. I thought the issue was "Can a client use both Par3 files to help recover?"

Are you asking "Can a Par3 generate the second Par3 file quickly given the first Par3 file?" If that's the case, it is easier with Par3, because the matrix coefficients are explicitly written in the file, not implicitly calculated like in Par2. The client can subtract the effect of deleted input blocks from the recovery blocks and remove those coefficients from the matrix. Then, for new files, add new coefficients and apply the effect of new input blocks to the recovery blocks. Easy peasy.

We could identify the inputs blocks used by each recovery block using the MD5 hash of the input block

Do you see a problem with having two blocks with different data but the same MD5 hash (since MD5 isn't a collision resistant hash)? PAR2 does have this problem somewhat, though most clients seem to assume that data is generally sequential.

MD5 can have collisions, but PAR has always assumed it does not. The chances of a non-malicious collision is very very small. PAR has been focused on data recovery from random damage, not data integrity from a malicious attacker. If the user is worried about malicious actors, the user will have to use another tool in addition to PAR, like a file encrypter (PGP, GPG, etc.).

They're shorter than a FileID + offset. And they don't change if you change filenames.

Is there any reason to not just use a 2-byte index instead? It's even shorter.

We need to identify input blocks either by their MD5 hash or by file+offset.

The MD5 hash is longer, but it allowed input blocks to be reused by files. E.g., if a file is duplicated, but has a different name, or if files have overlapping blocks (e.g., the same start of the file). It directly identifies data, rather than indirectly. It actually makes handling filename changes easier.

You don't need alignment. [for files]

This could make supporting updates difficult, if files are packed densely together. E.g. replacing a file with a larger file might either leave holes (act as a delete + add) or require some complex fragmentation management scheme, I'd think.

Sorry, I made a mistake. I meant to say, we don't need to pack files. All files would be aligned --- the start of a file is always the start of a block. The same as in Par2. Each input block contains only the part of a single file (unless the file contents are identical).

This design supports more than 2^16 input blocks

Ah, so kind of like having separate Parchives (or recovery sets), but packed into one set?

The usual case is more an optimistic hope-we-don't-get-collisions approach. That is, I hope we don't have input blocks missing with the same coefficient in the few recovery blocks we have. We'll be able to recover from most problems, because missing inputs are likely to have different coefficients in some recovery blocks. That is, the matrix is likely to be invertible. Of course, moving to a larger Galois Field would reduce the chance of this problem.

The idea of "multiple recovery sets in a single file" is also possible. Tornado Codes basically do this. They assume some recovery block is only missing 1 input and they recover that single input. Then, they repeat, with the hope that some other recovery block is now missing just 1 input. When we write the spec, we will probably have to address how to recover these kinds of cascading codes.

Thanks for replying! And thank you for your comments!

animetosho commented 5 years ago

But, if a problem happens, can they recover the damage? Let's say they have 1024 recover block and they have 1024 damaged blocks

If I have important data which has become corrupted, waiting 1 hour for it to recover sounds quite reasonable to me. In fact, leaving the computer churning for several hours or even days on the task may be worthwhile, depending on the importance of it. I don't know whether others feel the same way.

1024x1024 will take over an hour. That's just a guess - we can test it if you like

My back-of-the-envelope calculations tell me it shouldn't take anywhere near that long. But no need to speculate, as it's easy to test.

I created a PAR2 from a 4.38GB (DVD-sized) file with 1000 input blocks and 100% recovery. Then I got the latest Multipar to recover from this without the original (i.e. 1000 damaged blocks). Inversion looks like it completed in under a second on this machine, and the entire recovery process (including reading/writing files, computing recovery etc) took just over 2 minutes.
Note that I did not enable GPU acceleration - this is just the CPU only.

But 1000 blocks isn't much. I decided to test the upper limit of PAR2 - 32768 recovery blocks.

As an aside, creating this PAR2 requires computing roughly 4.38GB*32768 = 140.2TB worth of data. Using ParPar v0.3.0, default settings, this took 26.5 minutes:

Multiply method used: Shuffle (512 bit), 16 threads
Generating 4486.38 MiB recovery data (32768 slices) from 4486.37 MiB of data
Calculating: 100.00%
PAR2 created. Time taken: 1590.923 second(s)

I got Multipar to recover with this file. It looks like Multipar only uses 2 threads during matrix inversion, which really doesn't take advantage of this CPU's 8 cores (is there some limitation on parallelism for Gaussian Elimination?), but anyway, letting it run for a while, it estimated the total time to invert to be around 76 minutes:

Multipar screenshot

I'm using an 8 core / 16 thread CPU for this test, which should be comparable to what one could find new mid-high end consumer desktop bought in 2019.
It may be possible to improve performance over what Multipar can offer today, but I don't know enough about the algorithm to judge (use more threads? GPU acceleration? faster algorithms?).

Are you asking "Can a Par3 generate the second Par3 file quickly given the first Par3 file?"

Yes. I suspect most users would think of 'updates' as something like that. I'd imagine that generating additional files and keeping old Parchives (in their mind: "not relevant") to not be the natural thing to do.
For example, in a traditional ZIP archive, updates generally replace stuff inside the archive. Most don't really consider some "delta ZIP" archive created afterwards to really be an update to the original, although one could certainly take that approach for cases like incremental backups (but deletes still feel awkward).

The client can subtract the effect of deleted input blocks from the recovery blocks and remove those coefficients from the matrix. Then, for new files, add new coefficients and apply the effect of new input blocks to the recovery blocks.

Okay, so this sounds like the update process may need to rewrite a fair bit of the PAR file, since it'd have to touch every block that the input was included in, and recompute the recovery data from the original file to perform the subtraction. I presume this could also leave 'holes' in the Parchive, but if files need to be block aligned (as you mention later), this probably isn't an issue.

I'm not sure how practical requiring the original files, to perform updates, is. To me, the typical use case would be that someone updates a file (without retaining the original/pre-updated copy), and then wishes to incorporate this update into the PAR - which doesn't seem possible because you can't compute the effect that the original file had to perform the necessary subtraction.

The chances of a non-malicious collision is very very small. PAR has been focused on data recovery from random damage, not data integrity from a malicious attacker

I don't see a malicious actor being an issue, rather, inadvertent user error. It doesn't seem inconceivable for a user to have examples of MD5 collision files on their disk (e.g. if they're doing research on the topic, or maybe there was something malicious but it wasn't targeting PAR specifically, and caused "MD5 duplicate" to exist on disk). If a PAR was created off this data, then you could end up with such MD5 collision issues, even though the PAR file wasn't maliciously tampered with.

I don't see this being a common case, but then, we know that PAR2 can sometimes fail due to faulty math, and that doesn't tend to show up often either.

We need to identify input blocks either by their MD5 hash or by file+offset.

My idea was that you only needed the MD5 for matching with external data, whereas you could just reference them internally via a 16-bit index.
However, if the aim is to support more than 2^16 input blocks, as you mention later, this wouldn't work, so referencing via MD5 hash does sound reasonable.

That is, I hope we don't have input blocks missing with the same coefficient in the few recovery blocks we have. We'll be able to recover from most problems, because missing inputs are likely to have different coefficients in some recovery blocks. That is, the matrix is likely to be invertible.

I think I've understood what you're trying to get at here now; I presume this also means that you could have >65k recovery blocks as well?

mdnahas commented 5 years ago

My back-of-the-envelope calculations tell me it shouldn't take anywhere near that long. But no need to speculate, as it's easy to test.

Thank you for testing it! My back-of-the-envelope number also said it should be faster. I'm now wondering about my memory ... maybe it was a 2000x2000 matrix that took 4 minutes on my old system? In any case, thanks for looking into it!

If I have important data which has become corrupted, waiting 1 hour for it to recover sounds quite reasonable to me. In fact, leaving the computer churning for several hours or even days on the task may be worthwhile, depending on the importance of it. I don't know whether others feel the same way.

I think it depends on the application. For really valuable data, people will wait a long time.

Are you asking "Can a Par3 generate the second Par3 file quickly given the first Par3 file?"

Yes. I suspect most users would think of 'updates' as something like that. > I'm not sure how practical requiring the original files, to perform updates, is. To me, the typical use case would be that someone updates a file (without retaining the original/pre-updated copy), and then wishes to incorporate this update into the PAR - which doesn't seem possible because you can't compute the effect that the original file had to perform the necessary subtraction.

Yeah, so that usage case doesn't seem to work. I don't think the other use case (2 PAR files where the recovery sets intersect but aren't equal) is common, but it might be helpful for some people.

The chances of a non-malicious collision is very very small. PAR has been focused on data recovery from random damage, not data integrity from a malicious attacker

I don't see a malicious actor being an issue, rather, inadvertent user error. It doesn't seem inconceivable for a user to have examples of MD5 collision files on their disk (e.g. if they're doing research on the topic, or maybe there was something malicious but it wasn't targeting PAR specifically, and caused "MD5 duplicate" to exist on disk). If a PAR was created off this data, then you could end up with such MD5 collision issues, even though the PAR file wasn't maliciously tampered with.

I understand it might happen. Most people who keep those around know they can screw with a tool. One option is to have the client detect the problem.

We need to identify input blocks either by their MD5 hash or by file+offset.

My idea was that you only needed the MD5 for matching with external data, whereas you could just reference them internally via a 16-bit index. However, if the aim is to support more than 2^16 input blocks, as you mention later, this wouldn't work, so referencing via MD5 hash does sound reasonable.

That is, I hope we don't have input blocks missing with the same coefficient in the few recovery blocks we have. We'll be able to recover from most problems, because missing inputs are likely to have different coefficients in some recovery blocks. That is, the matrix is likely to be invertible.

I think I've understood what you're trying to get at here now; I presume this also means that you could have >65k recovery blocks as well?

Yep. We could actually support Tornado codes, which have a lot of recovery blocks.

animetosho commented 5 years ago

One option is to have the client detect the problem.

Ah, that's a very good point!

mdnahas commented 4 years ago

I've started thinking concretely about the next version of Parchive file format. I'd like to write the spec in the next week or so.

I have two major goals. The first is has to do with archiving programs ("archivers"). These programs, like tar, zip, etc., do a lot of things: preserve file attributes, join multiple files into one file, compression, encryption, etc.. I think the primary goal of Par3 should be to do the last two steps well: create redundant data and split the data into multiple pieces (for transmission or storage). Par3 should have minimal support for the other features of archiving; if users want those other features, the other archivers do it well.

The other major goal with Par3 is to support any linear systematic code. Client writers can support Vandermonde Reed-Solomon, Cauchy Reed-Solomon, LDPC, sparse matrix, or any thing else. This should also fix the Reed-Solomon issue we've had in Par1 and Par2.

I have a lot of minor goals for Par3. Support more files. Support empty directories. Support UTF-8 filenames. Support simple name changes. Support multiple Par3 files with overlapping recovery sets. Support single-pass creation/recovery in most cases. (Obviously, when a file is damaged, it may require 2 passes, but I think most can be done in a single pass.) Support Par3 recovery data inside a ZIP file (or other file format). Keep file overhead predictable and low.

Because I'd like to do single pass creation/recovery, the file will usually contain these packets in this order (although some will be duplicated, obviously).

creator packet, saying what program created it
file "hints" packet, saying filenames and sizes
matrix packet (see below)
input file slice packets or input file slice checksum packet
recovery slice packet
file description packet, containing checksums for files
main packet

One big change is having 2 kinds of file packets. This is necessary to support single-pass creation and recovery. The first file packet is a "hint" packet that says what the file name is and how large the file is. This way, the client decoding the Par3 file knows how much data will be arriving and can create a file with the right name to put the data into. However, the client doesn't get the checksum for the file until it receives the file description packet at the end.

The other big change is the addition of a matrix packet. That packet will specify the slice size, the amount of input and recovery data, and matrix coefficients used to create the recovery data. The packet format that will allow complete specification of the matrix. Since a whole matrix is large, there will probably be a special cases for common matrices. Those special cases will support Cauchy Reed-Solomon matrices, sparse random matrices, and sparse non-random matrices. (Cauchy matrices are easier to generate than Vandermonde and there are fast algorithms for inverting them. Sparse random matrices, where each input slice is in log(N) recovery slices, are fast to compute and almost as good as Reed-Solomon. Sparse non-random matrices support LDPC codes.)

How does that sound?

Do people have comments on:

Using 4-byte alignment or raising it to 8, 16 or higher?
Using CRC32 as the "rolling" hash to detect file blocks.
Using MD5 as the file/block hash. Is there a faster + better one?
Is the 16k hash still a good way to identify misnamed files?
What to use as the FileID?
What to use as the Recovery Set ID?
Keep 16-bit Galois Field, or support other sizes?
Since matrices are changable, should we support changes to the Galois Field generator?

Is there anything else that we need to fix or is important for the next version of Par?

animetosho commented 4 years ago

Splitting the file info and hashes is an interesting idea. Would there be split slice checksum packets too?
I'm guessing the idea is that a streaming encoder could write the file checksums once it's processed all input, but slice checksums would have a similar issue. A decoder would want the slice checksums fairly early for verification. I think the only solution would be to allow the slice checksums packet to be split up and interleaved.

Note that this only matters if the matrix used allows recovery to be generated without having it all fully buffered in memory. Otherwise, a "streaming" encoder could easily write the checksum packet at the beginning, since it has to buffer the bulk of the file anyway.

Although there's a suggested order, I presume ordering of packets is up to the encoding client, like PAR2?

I've already given a bunch of my thoughts for a bunch of the points, so won't repeat myself on those.

Using 4-byte alignment or raising it to 8, 16 or higher?

I'm actually not sure what the point of the alignment requirement was. To help with using struct based parsing, maybe??
If the client wants to support recovering from deleted/inserted bytes (in the Parchive file itself), it can't rely on the alignment being correct anyway.

Is the 16k hash still a good way to identify misnamed files?

Personally it feels a little fragile (e.g. won't work if the first block is invalid), but it's only to handle the unusual case of misnamed files, so probably doesn't matter much.

Since matrices are changable, should we support changes to the Galois Field generator?

Is there any benefit to allowing this to be changed? A defined generator can help keep implementations simple, though may not matter so much for GF16 since you can't hard code tables as easily, unlike GF8.

The only thing I could think of was when I was considering a CLMUL implementation. The current polynomial doesn't reduce well via CLMUL (requires 4 multiplies), so a more easily reduced default could help with such an implementation, but I ultimately don't see it being useful even then. If you really cared, you could just pick a different default - I can't see much reason to make it customizable.

mdnahas commented 4 years ago

I think it is clearer if I explain how a client generates a Par3 file in a single pass and how it recovers in a single pass.

The client generating the Par3 file will allocate a buffer in RAM for each recovery slice. Then the client reads the input slices. For each input slice and each recovery slice, it multiplies the input slice by a coefficient from the matrix and adds it to the recovery slice's buffer. In parallel, the client computes the checksum for the input files. After reading all the input files, the buffers are written out into the recovery slice packets. Then, the checksums are written out into the file description packets.

Now, let's look at the client doing recovery in a single pass. This client also allocates a buffer in RAM for each (expected) recovery slice. The client reads the input files and finds valid slices. The client writes the data to the appropriate file and, for each valid input slice and each (expected) recovery slice, the client multiplies the data by a coefficient from the matrix and subtracts it from the recovery slice's buffer. After the input slices, the client searches for valid recovery slice packets. The value of each valid recovery slice is added to each recovery slice buffer. If an expected recovery slice does not arrive, its buffer is destroyed. Then, if there are input slices missing, the matrix is inverted and the data in the buffers is used for recovery. The recovered data is written out. If recovery isn't necessary, the checksum can be calculated in parallel and, at the end, compared to the value in the file description packet.

Now, I think I can answer your questions. The decoder doesn't need recovery slice checksums at the start, because the decoder assumes the recovery slices will arrive before we need to invert the matrix. Because addition for the Galios Field is commutative, we can subtract the effect of each input slice before adding the data stored in the recovery slice packet.

Notice that at any point, we only need in memory: the K recovery slice buffers, 1 input slice and 1 row (column?) of the matrix.

This approach cannot be used if we need more recovery data than we have RAM. It may also be faster for the decoder not to compute all the recovery blocks where only a few are needed to recover the data. So, decoding in 2 passes may be faster than decoding in 1.

Although there's a suggested order, I presume ordering of packets is up to the encoding client, like PAR2?

Yes. But if you want the receiver to do recovery in a single pass, they receiver has to get the packets in a certain order. If they don't, the receiver will have to fall back to a two-pass algorithm.

byte alignment

Some processors and memory systems are faster if data is aligned on 2^N-byte boundaries. E.g., if a multi-byte value crosses a cache line boundary, a read or write will cause the cache system to do 2 operations instead of 1. If you look at how loops are unrolled, a common technique is to unroll 4 or 8 copies of the loop so that the CPU and memory system works more efficiently. While Par clients need to handle a loss of a single byte, it is probably best if every packet is 8-byte aligned.

Galois Field Generator

Some of the research by James S. Plank has to do with fast Galois Field arithmetic. He's also done research on fast Cauchy matrix operations that depend on the Galois Field used. We've also talked about using 8-bit or 32-bit Galois Fields. So, a change in the future might be done.

I was thinking of including fields for the Galois Field size and generating polynomial. Initially, the only valid values would be 16 and 0x0001100B. But in the future, we might add others.

More thoughts

I think CRC32 is good to keep
MD5 may be replaced, if we can find something fast and good. Does anyone know anything about BLAKE3 or KangarooTwelve? They are reportedly 10 times faster and more secure. I'm not tied to a cryptographically secure hash, but I couldn't find non-cryptographic hashes that I liked.
Do people worry about a Par file inside another Par file?

animetosho commented 4 years ago

Thanks for the explanation.

If the encoder needs to buffer every recovery slice in memory, then it shouldn't have any issue with writing the checksums at the beginning of the file, or wherever it likes, so splitting the metadata and checksums seems to be unnecessary in that case.
In other words - to use your example - after reading all the input files, the encoder could write the file checksums, then write the buffers as recovery slice packets - I don't see any particular reason why it would have to write the checksums after the recovery slices.

I can see a benefit if the encoder doesn't need to buffer every single recovery slice in memory, i.e. the matrix is chosen in a way that some recovery packets can be generated without all the input, in which case the encoder could use less memory by writing out processed recovery slices before it knows the full checksum.

The client reads the input files and finds valid slices

Wouldn't the client need to have the input slice checksum to know if a slice is valid or not?

The decoder doesn't need recovery slice checksums at the start

To clarify, when I mentioned slice checksums I was referring to the input slice checksums. Recovery slice checksums are included inside the recovery packets themselves (or is that changing?).

Some processors and memory systems are faster if data is aligned on 2^N-byte boundaries. E.g., if a multi-byte value crosses a cache line boundary, a read or write will cause the cache system to do 2 operations instead of 1

The in-memory representation doesn't have to have any bearing over the on-disk representation. Even if it somehow does (mmap'd access perhaps?), you have a few things to consider:

misalignment handling is generally quite efficient on modern processors
parsing PAR2 (or writing it out) generally isn't a bottleneck or performance concern
efficient alignment can be very platform specific and hard to generalize. For example, an AVX512 algorithm prefers 512-bit (or 64-byte) alignment, so no 4/8/16 byte alignment requirement is going to be sufficient, unless you're willing to have lots of padding throughout the file
memory representation may need to be different to on-disk representation, e.g. big vs little endian, if you need strings in UTF-16 etc

Given that the decoder will need to be coded to deal with misalignment, and that the encoder can pick arbitrary alignment, if it sees any benefit in doing so, I don't see much reason in the spec forcing alignment requirements.
On the flipside, I don't really see much harm in enforcing alignment either.

Some of the research by James S. Plank has to do with fast Galois Field arithmetic. He's also done research on fast Cauchy matrix operations that depend on the Galois Field used.

I believe the fast algorithm described in that paper works with any generator. I don't know anything about the fast Cauchy matrix operations, so I suppose there's an example where it could be useful.

Does anyone know anything about BLAKE3 or KangarooTwelve? They are reportedly 10 times faster and more secure.

The speed is gained from SIMD, which MD5 can't really use. This makes them much faster for a single buffer setup.
For a multi-buffer MD5 SIMD approach, MD5 is likely faster. The other hash functions probably don't gain anything from multi-buffer as they already use SIMD for single-buffer.

Multi-buffer hashing is harder to deal with, so I would definitely prefer a faster single-buffer hash even if it means that multi-buffer becomes unavailable. ParPar is the only PAR2 client I know of which employs a multi-buffer MD5 hash.

Do people worry about a Par file inside another Par file?

Like PAR protecting another PAR? I'm not sure of the use case...

mdnahas commented 4 years ago

Thanks for the explanation.

If the encoder needs to buffer every recovery slice in memory, then it shouldn't have any issue with writing the checksums at the beginning of the file, or wherever it likes, so splitting the metadata and checksums seems to be unnecessary in that case.

True. The input file checksums can go before or after the recovery data. I was thinking of the case of streaming, where the Par3 file contains the input file slices. In this case, the client sending the stream can't send the checksum until it gets to the end of the data. So, the input file checksums have to go after the input file slices. They could go before or after the recovery data.

I'd recommend putting them at the end of the file. If a client wants to search for them, they're easiest to find there. ZIP and other formats put the directory summary at the end for that reason.

The decoder doesn't need recovery slice checksums at the start

To clarify, when I mentioned slice checksums I was referring to the input slice checksums. Recovery slice checksums are included inside the recovery packets themselves (or is that changing?).

Sorry, my sentence doesn't make sense. I was trying to say that we don't need the recovery slice data before we start processing input slice data. The usual way of processing is loading the recovery data and then subtract the effect of the input slices. But, because addition is commutative, this can be reversed. So, in single-pass recovery, you start with a zeroed buffer, subtract the effect of the input slices and then add the recovery data. Does that make sense?

The in-memory representation doesn't have to have any bearing over the on-disk representation. Even if it somehow does (mmap'd access perhaps?), you have a few things to consider:
* misalignment handling is generally quite efficient on modern processors
... Given that the decoder will need to be coded to deal with misalignment, and that the encoder can pick arbitrary alignment, if it sees any benefit in doing so, I don't see much reason in the spec forcing alignment requirements. On the flipside, I don't really see much harm in enforcing alignment either.

As for misalignment on modern processors, Intel's optimization guide mentions alignment 800+ times.

Yes, I was thinking of memory-mapped files. You have many good points, but I think it costs us little and may make it easier for some.

Some of the research by James S. Plank has to do with fast Galois Field arithmetic. He's also done research on fast Cauchy matrix operations that depend on the Galois Field used.

I believe the fast algorithm described in that paper works with any generator. I don't know anything about the fast Cauchy matrix operations, so I suppose there's an example where it could be useful.

True. I also looked at his Cauchy matrix paper and it doesn't seem to depend on the GF generator.

Still, I like including it, in case we find a faster GF algorithm.

Does anyone know anything about BLAKE3 or KangarooTwelve? They are reportedly 10 times faster and more secure.

The speed is gained from SIMD, which MD5 can't really use. This makes them much faster for a single buffer setup. For a multi-buffer MD5 SIMD approach, MD5 is likely faster. The other hash functions probably don't gain anything from multi-buffer as they already use SIMD for single-buffer.

Multi-buffer hashing is harder to deal with, so I would definitely prefer a faster single-buffer hash even if it means that multi-buffer becomes unavailable. ParPar is the only PAR2 client I know of which employs a multi-buffer MD5 hash.

Ok. I looked a bit at Blake3 and KangarooTwelve and it seems they only get the faster speed on really big pieces of data. They'd be much faster for checksumming files, but maybe not faster on smaller Par packets.

There are faster non-cryptographic hashes that we could use for packets. But I'd rather keep fewer hash codes, unless it is a performance bottleneck.

Do people worry about a Par file inside another Par file? Like PAR protecting another PAR? I'm not sure of the use case...

I don't think people want to do it on purpose. I can't think of a good use case. I'm more worried that someone does it by accident and a client trying to decode sees Par packets inside Par packets.

animetosho commented 4 years ago

Ah, for embedding input slices into the Parchive itself, yes, you'd need to write the checksums after the input data.

So, in single-pass recovery, you start with a zeroed buffer, subtract the effect of the input slices and then add the recovery data. Does that make sense?

Yes, but what I was referring to was actually how the decoder knows whether the input (not recovery) it's been given is valid. It needs to know that it's valid before it can compute the recovery effect whilst it's streaming.

To do this, it needs the checksums of the input slices. If the input slices are embedded in the Parchive itself, it's not really an issue since each packet has a checksum. If the input data is external, there needs to be a packet listing the input checksums.

My original thought was that if some recovery could be computed without all input slices, you'd need the input slice checksums fairly early. But it doesn't look like this is an aim, so I guess it doesn't really matter.

As for misalignment on modern processors, Intel's optimization guide mentions alignment 800+ times.

Intel processors prior to Nehalem (released in 2008), plus first gen Atom, have fairly bad misalignment handling, so I wouldn't be surprised if there's a lot written about them. On current Intel processors, it's almost free - I think the only penalty is when an access crosses a 64-byte cacheline, in which case, it just acts as if you did two accesses.
Having said that, aligned is generally better than unaligned, even if the difference is minimal. One thing to keep in mind with large alignments though, is cache size and bandwidth - lots of additional padding can be a waste of cache space and bandwidth.
(though at the end of the day, none of this really matters much)

I think it costs us little and may make it easier for some.

If you're looking to make alignment issues easier to deal with, one thing I'd consider is sticking to power-of-2 sizings where it makes sense. For example, the recovery slice's header is 68 bytes (64 bytes for PAR2 packet + 4 byte exponent), but I need the recovery data to be 64 byte aligned for efficient 512-bit SIMD. I deal with this by allocating 128 bytes for the header (64 byte aligned), and offset the header by 60 bytes so that the data ends up sitting on a 64 byte boundary. This isn't hard to do, but if you're looking to simplify things, trying to fit the recovery header in a power-of-2 size (e.g. 64 bytes) is something to consider.

I looked a bit at Blake3 and KangarooTwelve and it seems they only get the faster speed on really big pieces of data.

This is true for anything performance related really. The biggest differences will be on large inputs. Small inputs will be dominated by overhead, so improvements will seem smaller.

but maybe not faster on smaller Par packets.

Unless you're concerned about <0.5KB slices sizes, they most definitely will be faster.
If you are concerned about such small sizes, well, there'll never be enough data with the 32768 slice limit for it to matter much.

I'm more worried that someone does it by accident and a client trying to decode sees Par packets inside Par packets.

That's an interesting point. I suppose this was the point of the recovery set ID - to be able to differentiate between multiple potential nested PAR2 sets.

If a decoder sees multiple sets, I guess there's a few things it could do to resolve the issue:

assume that the last packet reflects the real Parchive?
try to determine the outermost set of packets by looking at the ranges they cover (e.g. the outermost Parchive should have ranges which enclose inner Parchives). Decoder may have to be coded to deal with potential conflicts from specifically crafted Parchives
ask the user?

I think CRC32 is good to keep

Probably should've mentioned what I said earlier:

Probably would select CRC32C over CRC32 due to better detection properties

Also note that x86 has an instruction for computing CRC32C (though it's named CRC32), so there's another benefit of CRC32C over CRC32.

animetosho commented 4 years ago

Since matrices are changable, should we support changes to the Galois Field generator?

Actually, I just remembered another use case: the GF2P8MULB instruction has a defined polynomial (performs GF8 multiplication with polynomial 0x11b).

I don't know whether GF8 multiplication can be used in GF16, but if there's ever any GF16 multiply instruction with a defined polynomial, it could be used.