Open mdnahas opened 5 years ago
will Par3 support only slow matrix-based O(n^2) Reed-Solomon or is faster O(n log n) Reed-Solomon also supported?
As a file format, it's designed to support any Galois field and any code matrix. But, someone needs to write source code to use FFT based Reed-Solomon Codes for PAR3. Though there are some papers about the technology, I could not understand the math. Actualy, I cannot understand even these current recovery codes on PAR3 spec. (I'm hard to read mathematical words in English.) That is why I cannot write a reference implementation of PAR3 client.
As a file format, it's designed to support any Galois field and any code matrix.
The "any code matrix" part is what I don't understand. I'm able to implement FFT based Reed-Solomon but I can't see any matrix there. (But it is possible it's hidden somewhere in the algorithm as I don't really understand it.)
I could not compile KangarooTwelve's sample code on Microsoft Visual Studio 2019. Does someone know where to get KangarooTwelve's optimized source code for MSVC ?
They give some info here. Basically you need to use the make
tool under Cygwin/MSYS2 to generate the .vcxproj file. I've attached the output of such below, in case you are unable to do it:
As they mention, assembly optimisations aren't available on MSVC because they're written for GCC. Perhaps it's possible to assemble these into object files, then disassemble into MASM format, but I haven't bothered checking that.
I do note that the assembly optimised version does run noticeably faster than the non-assembly version.
It looks like BLAKE3 internally uses a tree, which means that it can scale across cores without issue. This gets around the issue of a single-threaded hashing bottleneck that currently exists in PAR2.
Hi folks, I've been gratefully watching your work for some time and am eagerly awaiting the PAR3 spec to solve some of the problems we're facing in the space where I work (scientific computing). Our use case is that we have millions of files (mostly video and audio from sensors - e.g. TIF, JPEG, MP4) in a dataset, and the dataset grows slowly over time - for example, our botanical scientists have over 1 million ultra-high resolution photos of plants (around 500MB for each photo), which are used to identify changes in plant genetics. We have multiple similar use cases, and adopting a normal cloud service for this (E.g. AWS S3) would be totally cost prohibitive. We currently store this data in a messy combination of onsite NAS and tape storage, and we want to try adopting Usenet as a place to store and share these huge datasets. But, error correction with PAR2 on a 500TB file collection made of 700K article size and a 32768 piece limit is obviously very inefficient - each PAR2 piece would span across approx. 21798 Usenet articles!
So I was excited to read the spec as posted by @mdnahas (thank you very much for your wonderful work!) but it does seem firstly that there is an extension of scope here that makes PAR3 harder to implement. Commenting against your design goals from line 40, the below 'make sense' to me (as a non-expert!) as an improvement of PAR2 as it is used today for Usenet binaries :
support more than 2^16 files
support UTF-8 filenames (Par2 added this after version 2.0.)
support more than 2^16 blocks
replace MD5 hash (It is both slow and less secure.)
support "tail packing", where a block holds data from multiple files
However the below seem to be new features that don't fit in with the Usenet PAR2 workflow. To comment on each from the perspective of my use case (and indeed, anyone who wants to create redundancy on a large number of files in one PAR3 set, like the poster above who wants to back up 100K images):
* support "incremental backups"
- this is obviously not supported today, but would be fantastic for us. We would love to be able to add a file to a PAR set, and have the program produce some extra files that, alongside the original files, would allow for the restoration of this new file. But I don't believe this is mathematically possible without recreating the entire PAR set?
* support deduplication, where the same block appears in multiple files
- shouldn't add much overhead, but not really valuable, as a modern archive format (e.g. 7zip) already does this via its 'compression dictionary'.
* dropped requirement for 4-byte alignment of data
- this would only save 3 bytes per block, best case?
* support empty directories
- not sure why someone would need this when the archive format can handle this?
* support file permissions
- security practices I've seen would only ever set permissions at the parent folder/'container' level, making file based permissions unnecessary? Additionally, as permissions work differently between different OS's, it seems like a messy requirement that doesn't add value
* support hard links and symbolic links
- once again, not sure why someone would need this? I would just ship my archive with a little script that creates these if necessary?
* support files that work both as a Par3 file and another type. For example, putting recovery data inside a ZIP, ISO 9600 disk image, or other file.
- I don't understand what the use case for this would be? We can already zip up PAR3 files, but why would we ever do this?
* support for storing data inside a Par3 file (This was optional in Par2.)
- I don't exactly understand this feature?
* support any Galois field that is a power of 2^8
- The user is going to just want a sane default field setting that's fast on x86. No need to make it more complicated?
* support any linear code (Reed-Solomon, LDPC, random sparse matrix, etc.)
- Same as above.
I appreciate that there are going to be users of PAR3 who do have use for the above features! But are they the users that the developers wish to address? They shouldn't feel the pressure to address them if they don't want to. I feel like there is a branching design goal here - is PAR3 just 'better than' PAR2 for the same use cases? I feel like this is the user expectation.
Instead, if one wants to create a new archive format that captures more of the "definitions" (e.g. archiver, checksummer, grouper, etc. as described prior by @mdnahas) it would be extremely welcome work, as it would remove the need for multiple file passes and could include support for things like on-the-fly error correction of video streams - all very welcome, but probably outside the user-perceived scope of PAR3 :)
Please forgive me if I have misunderstood any of the above, and my apologies if this post is unwelcome. Thanks again!
As they mention, assembly optimisations aren't available on MSVC because they're written for GCC.
Yes, this is the problem. They might forget to include AVX2noAsm and AVX512noAsm version code in the package. eXtended Keccak Code Package seems to contain them. Thanks Mr. Anime Tosho for the advice.
I feel like there is a branching design goal here - is PAR3 just 'better than' PAR2 for the same use cases? I feel like this is the user expectation.
I thought so, too. Users' voice is welcome. Thanks Mr. Dimitri Vasdekis. Because I'm a lazy programmer, I felt that small improvement for current usage might be enough. I'm not sure that I will be able to implement some new features in PAR3 specifications. But, I will try, when users request a feature in future.
@dvasdekis writes:
Hi folks, I've been gratefully watching your work for some time and am eagerly awaiting the PAR3 spec to solve some of the problems we're facing in the space where I work (scientific computing). Our use case is...
I'm glad Parchive is being considered. I love hearing about users and new use cases.
the below 'make sense' to me (as a non-expert!) as an improvement of PAR2 as it is used today for Usenet binaries : ...
However the below seem to be new features that don't fit in with the Usenet PAR2 workflow. To comment on each from the perspective of my use case (and indeed, anyone who wants to create redundancy on a large number of files in one PAR3 set, like the poster above who wants to back up 100K images):
* support "incremental backups"
- this is obviously not supported today, but would be fantastic for us. We would love to be able to add a file to a PAR set, and have the program produce some extra files that, alongside the original files, would allow for the restoration of this new file. But I don't believe this is mathematically possible without recreating the entire PAR set?
An example of incremental backup would be: you have 100 files and protect them with File1.par3, and then add 10 more files and protect all of them with File2.par3. A client will be able to read both File1.par3 and File2.par3 and use them to recover the 110 files. Obviously, File1.par3 will not protect any of the 10 new files.
The incremental backup case makes sure that you can use older Par3 files with newer ones. You can get by with a smaller File2.par3 because File1.par3 exists.
I suppose a client could read File1.par3, compute the effect of the 10 new files, and write File3.par3, which would protect all 110 files. But that is not the "incremental backup" case. That, in fact, could be done with Par2 right now.
* support deduplication, where the same block appears in multiple files
- shouldn't add much overhead, but not really valuable, as a modern archive format (e.g. 7zip) already does this via its 'compression dictionary'.
Yes, if you want to compress all your files first. If you want to protect files on a USB drive and have the files uncompressed, the feature is valuable. So, the feature is not useful for the Usenet use case. That is not the only use case.
* dropped requirement for 4-byte alignment of data
- this would only save 3 bytes per block, best case?
That is a longer discussion. You can find it discussed above in this issue thread.
* support empty directories
- not sure why someone would need this when the archive format can handle this?
You're assuming someone compresses first.
* support file permissions
- security practices I've seen would only ever set permissions at the parent folder/'container' level, making file based permissions unnecessary? Additionally, as permissions work differently between different OS's, it seems like a messy requirement that doesn't add value
Yes, it is messy. It may not add value to your use case, but it does to people protecting a USB drive.
* support hard links and symbolic links
- once again, not sure why someone would need this? I would just ship my archive with a little script that creates these if necessary?
Now, you're assuming the receiver will run a script after receiving the data?? Yeah, I would never do that.
* support files that work both as a Par3 file and another type. For example, putting recovery data inside a ZIP, ISO 9600 disk image, or other file.
- I don't understand what the use case for this would be? We can already zip up PAR3 files, but why would we ever do this?
If you want to send a file that works both as a Par3 file and a ZIP file. That is, one where if a user only has ZIP, they can unzip the file. But, if they have a Par client, they can fix the damaged file and unzip it.
* support for storing data inside a Par3 file (This was optional in Par2.)
- I don't exactly understand this feature?
Normally with Par2, you send the original files and a .par2 file. In this usage, you only sent a .par2 file, but the data for all the original files is stored inside the .par2 file. It makes .par3 behave like most other archive file formats.
The feature was optional in Par2. I was hoping that clients would use it and that Usenet would stop using RAR to split files. With a compressed video file, RAR compression does almost nothing --- people were only using RAR to split files into pieces. I had hoped that it could be done with Par2. In Par3, I made the feature mandatory.
* support any Galois field that is a power of 2^8
- The user is going to just want a sane default field setting that's fast on x86. No need to make it more complicated?
First, I don't think supporting any Galois field is very complicated. Second, picking a "sane default" is hard. Lastly, some algorithms (like FFT Reed Solomon) want to choose a particular Galois Field.
* support any linear code (Reed-Solomon, LDPC, random sparse matrix, etc.)
- Same as above.
This offers the trade-off of more protection vs. speed. Again, I think picking a "sane default" is hard. Also, it allows future development of algorithms.
I appreciate that there are going to be users of PAR3 who do have use for the above features! But are they the users that the developers wish to address? They shouldn't feel the pressure to address them if they don't want to. I feel like there is a branching design goal here - is PAR3 just 'better than' PAR2 for the same use cases? I feel like this is the user expectation.
My goal has always been to expand Parchive's use cases. If you read the design for Par1, you'll see it was very very tied to Usenet and to RAR. Par2 is used other places besides Usenet. I actually disappointed how limited the Par3 design is --- I wanted it to be used for many more things. But, because the data can be stored external to the Par3 file, the names of the files containing that data need to be readily accessible in the format. It actually took me a while to design Par3 because I struggled against that constraint.
Instead, if one wants to create a new archive format that captures more of the "definitions" (e.g. archiver, checksummer, grouper, etc. as described prior by @mdnahas) it would be extremely welcome work, as it would remove the need for multiple file passes and could include support for things like on-the-fly error correction of video streams - all very welcome, but probably outside the user-perceived scope of PAR3 :)
I had hoped to do that. Unfortunately, if I wanted Par3 to still be useful to Usenet, it couldn't be done. It will have to be some done by some other designer with another file format. It really should be done --- "tar" is old and crusty.
Please forgive me if I have misunderstood any of the above, and my apologies if this post is unwelcome. Thanks again!
Thanks for caring enough to comment. As I said, it's always interesting to hear from users and hear new use cases.
we want to try adopting Usenet as a place to store and share these huge datasets
I personally can't vouch for this being a great idea, but you're welcome to try I guess. Concerns are mostly around long term viability, but if it's just a backup copy (or just for distribution), it's probably not a big issue.
If you do go down the route, then note that a PAR2 doesn't need to cover the entire data set - you can choose to have multiple PAR2s cover subsets.
But I don't believe this is mathematically possible without recreating the entire PAR set?
The feature largely acts like separate PAR archives, where one declares another to be its parent, so it's as possible as it is to create completely separate PAR sets.
There's downsides to the approach, such as redundancy not being shared across the whole archive, and the structure could get unwieldy if it's extended a lot (since it's essentially a linked list). A client could always choose to sidestep the feature and just update the original PAR3 file instead, which wasn't really possible under PAR2 (PAR2's strict file ordering made it infeasible to do in most cases).
I would just ship my archive with a little script that creates these if necessary?
Remember that this is just the specification, which defines the maximum extent of what is allowed. You'd ultimately be interacting with a client, which will be responsible for catering for your use case (as long as the specification doesn't restrict its ability to do so).
If you don't want symlinks/hardlinks to be represented, ideally you'd tell your client such (if its default behaviour isn't what you want) and it'd just ignore those attributes when creating the PAR3.
I don't understand what the use case for this would be? We can already zip up PAR3 files, but why would we ever do this?
An example might be a program which creates an archive of files with redundancy (like the "new archive format" you mention later). Currently this requires two steps, and creates multiple files, whereas the change could allow for it to happen in one step, and bundled into a single file.
The user is going to just want a sane default field setting that's fast on x86. No need to make it more complicated?
Again, it'll be up to the client to determine how this is handled/presented to the end user, the specification just enables the possibility.
is PAR3 just 'better than' PAR2 for the same use cases?
It supports all the features that PAR2 has, so other than implementation complexity, it theoretically shouldn't be any worse :)
I was hoping that clients would use it and that Usenet would stop using RAR to split files.
RAR is mostly used because people are stubborn and blindly follow 20 year old tutorials/advice which recommend doing it.
Embedding files within PAR isn't helpful to the Usenet use-case (in fact, it'd be detrimental, because downloaders only download PAR files if they detect damage).
Wow! @animetosho I love your rant/explanation about RAR!!! I'd be happy if we convince people that Par3 means you don't need RAR. ... even if the truth is that they didn't need RAR in the first place!
Thanks @Yutaka-Sawada , I will change the standard back to 2-bytes for filename/dirname length and 4-bytes for path length.
I took another look at CRCs. I hated that the XZ one does the stupid bit twiddling at the start and end. It may make it mathematically meaningful, but it doesn't add anything to its error detection.
It looks like the CRC-64-ECMA is the same as XZ, but without the bit twiddling. ("ECMA-182"). So we could use that... but it is the larger polynomial.
The CRC-64-ISO has the small polynomial, but does the bit twiddling.
It took me a while to discover that the polynomial in the reference code for XZ, which is this: 0xC96C5795D7870F42 and is actually the bit-reverse polynomial that is on this page and this page is: 0x42f0e1eba9ea3693
You can see that when we write the XZ code's polynomial in binary: 1100100101101100010101111001010111010111100001110000111101000010 along side the the website's polynomial in binary: 0100001011110000111000011110101110101001111010100011011010010011 It takes some time, but you can check that they are reverses of each other.
So, I was able to find the small ISO polynomial on the webpages: 0x000000000000001b and the bit-reversed value is 0xD800000000000000 (You can also find this constant in this source code. )
Should we go with the ISO polynomial and no bit twiddling?
That would mean the XZ Code with
static const uint64_t poly64 = UINT64_C(0xD800000000000000)
and the lines
uint64_t
crc64(const uint8_t buf, size_t size, uint64_t crc)
{
crc = ~crc;
for (size_t i = 0; i < size; ++i)
crc = crc64_table[buf[i] ^ (crc & 0xFF)]
^ (crc >> 8);
return ~crc;
}
replaced by
uint64_t
crc64(const uint8_t buf, size_t size, uint64_t crc)
{
for (size_t i = 0; i < size; ++i)
crc = crc64_table[buf[i] ^ (crc & 0xFF)]
^ (crc >> 8);
return crc;
}
How does that sound?
@Yutaka-Sawada Do you think it will make the code faster (or easier to write)?
I hated that the XZ one does the stupid bit twiddling at the start and end. It may make it mathematically meaningful, but it doesn't add anything to its error detection.
A practical reason for non-zero initial value for CRC is to detect zero-bytes at start. With 0 as initial value CRC just ignores any zero-bytes at beginning, i.e. CRC remains 0 until first non-zero byte.
But my understanding is that that feature isn't useful when using fixed block size, like rolling CRC usually does. (It is useful when you have two blocks of different length but which have same content, except for the amount of zero-bytes at beginning.)
That would mean the XZ Code with ...
In addition to bit-complement at beginning/end, other difference between CRC-64/XZ and CRC-64/ECMA-182 is bit order (i.e. refin
and refout
parameters in CRC catalogue).
Should we go with the ISO polynomial and no bit twiddling?
When CLMUL is available on CPU, there is no speed difference between polynomials. Only when CLMUL isn't available, a polynomial of CRC-64-ISO is faster than that of CRC-64-XZ. But, PAR3 isn't designed for obsolated CPUs. So, there is no big reason to select a polynomial by speed.
By the way, it's possible to calculate CRC-64-ISO without table lookup. I put my hash testing application (TestHash_2022-01-11.zip) in "tool" folder on OneDrive. There are some implementations of hash functions. You may read the source code. The CRC-64-ISO's sample code is like below.
uint64_t
crc64(const uint8_t *buf, size_t size, uint64_t crc)
{
uint64_t A;
crc = ~crc;
for (size_t i = 0; i < size; ++i){
A = crc ^ buf[i];
A = A << 56;
crc = (crc >> 8) ^ A ^ (A >> 1) ^ (A >> 3) ^ (A >> 4);
}
return ~crc;
}
About bit flipping, I don't like the useless factor. The initial & final bit flipping affects CRC calculation at sliding window. You may see the file "crc.h" of par2cmdline. There is a value windowmask
for slide CRC window.
u32 windowmask = ComputeWindowMask(window);
crc = windowmask ^ CRCSlideChar(windowmask ^ crc, buffer[window], buffer[0], windowtable);
This windowmask
is the result of bit flipping in CRC-32. If you omit the initial & final bit flipping from CRC-32 value, it won't require windowmask at sliding like below.
crc = CRCSlideChar(crc, buffer[window], buffer[0], windowtable);
So, slide speed may become a little faster. But the difference would be very small and ignorable. While I removed the effect of bit flipping from CRC-32 before verification in my PAR2 client internally, I'm not sure the worth to effort.
If you want to slide CRC-64 for arbitrary window size (block size, 16 KB, or 40 bytes) in PAR3, three windowmask(s) will be required to search complete blocks, misnamed files, or tail chunks. Though it adds more complexity, the difference is very small. Technically, there is no problem at all.
Do you think it will make the code faster (or easier to write)?
No, it won't be remarkable. If you think that CRC-64-XZ is good for definition, it's OK. If you want a simple construction, using CRC-64-ISO's polynomial without bit flipping is the simplest. But, I cannot say that it should be which. There is no difference in speed nor noticeable complexity. You should consider another factor except speed. I'm sorry for confusing you by my old post.
BTW, did you see that there is a new version of BitTorrent?
New hash function. New directory structure. Sound familiar? :)
New hash function. New directory structure. Sound familiar? :)
Though likely for different reasons.
Hash function was changed from SHA1 to SHA256 due to security issues in the former. Security is actually more important in BT because you're taking data from untrustworthy peers.
I believe libtorrent's implementation is using the latest draft spec (from 2018), which was probably written before BLAKE3/K12 and similar, so uses the rather slow SHA256, though BT isn't otherwise CPU intensive and would likely be limited by network bandwidth.
The directory structure change is to avoid duplication in long/commonly-used path names. Although likely beneficial to PAR3 (though less so due to the various overheads), I believe your key motivator was to support empty directories.
BTv2 solves a rather annoying issue of BTv1, which didn't have stable per-file hashes, which I personally consider to be the biggest win.
this is probably a good place to recommended that users be warned when they create Par3 files containing names that are incompatible with Windows, Mac, or Linux systems.
About ivalid filenames on Windows OS, there seems to be some more. I post some examples of them.
Reserved filenames with extension are invalid, too. Such like; "CON.txt". The filename is case insensitive. "con" and "con.txt" are invalid, too. Windows Explorer cannot treat them.
Filename with period at the last is invalid, too. Such like; "something.". The last period is trimed by Windows Explorer at renaming. So, users cannot make such invalid filename normally. But, Windows Explorer cannot treat them.
Filename with space at the first or last is invalid, too. Such like; " something" or "soemthing ". These spaces seem to be trimed by Windows Explorer automatically. So, users may not see such invalid filename normally.
Okay, it sounds like we have a definition of CRC-64-ISO. And, while the bit inversion at start and end is annoying, it does serve a purpose and it is possible to optimize most of the cost away. I'll change the specification to CRC-64-ISO.
Are there any other open issues that need discussion?
If there are no open issues, I will:
We still need someone to code the reference implementation. I'm willing to help, with unit tests and other stuff, but I don't have time to code it.
Are there any other open issues with the specification? Is anyone interested in helping code the reference implementation?
Are there any other open issues with the specification?
I'm not sure that you answered already, but Markus Laire asked like below ago;
btw, will Par3 support only slow matrix-based O(n^2) Reed-Solomon or is faster O(n log n) Reed-Solomon also supported?
I think that a PAR3 client developer is possible to define a new Matrix Packets for his special Recovery Codes. Though the naming is Matrix Packet, it would be a definision of how to make recovery data.
Is anyone interested in helping code the reference implementation?
If nobody write, I will be able to try. While I'm difficult to implement PAR3's new Recovery Codes, I can copy most basic features from my PAR2 client. But, as I use Microsoft's C language and C-runtime library for Windows OS, my source code may be incompatible on other environments.
btw, will Par3 support only slow matrix-based O(n^2) Reed-Solomon or is faster O(n log n) Reed-Solomon also supported?
Good point.
I looked at the paper on FFT Reed Solomon by Lin, Chung and Han. It is implemented by the library Leopard on GitHub.
This algorithm uses the normal Galois Fields and should be compatible with the current Par3 specification.
I don't know what code matrix it is using. I've emailed the author of Leopard and one of the authors of the paper, to see if they can define the matrix.
If I hear back, I'll replace the Cauchy matrix with it.
Is anyone interested in helping code the reference implementation?
If nobody write, I will be able to try. While I'm difficult to implement PAR3's new Recovery Codes, I can copy most basic features from my PAR2 client. But, as I use Microsoft's C language and C-runtime library for Windows OS, my source code may be incompatible on other environments.
That's great! I'll be glad to help make the code run with GCC on Linux. As well as help with unit tests.
I'm done converting the specification to Markdown. It uses Github's own dialect of Markdown, because the default Markdown does not support tables. It is attached.
I've generated the HTML and made the changes for the website, but we'll have to wait for Ike to accept my push request before you can see them.
So excited! A great milestone! I can't wait to see the reference implementation!
Oh, also, how do you want to be credited in the specification? @Yutaka-Sawada @animetosho @malaire
We have a new repo for Par3!
I'm moving discussion of the specification there. Post all future messages to:
Hi everyone,
I wrote the specification for Par2 a long time ago. I'm working on the code for a new version of Par. It will include:
I've spent a week learning the code. I've written unit tests for some of the existing code. The tests should allow me to modify the code without breaking it. The unit tests should be run as part of "make check" but I don't know how to add them. (I've never learned Automake). Can anyone explain how?
I also plan on writing a diff tool that can compare Par files to make sure the packets are bit-for-bit identical. I'll use this to make sure that my changes haven't affected the program's output for version 2 of the specification.
I plan on adding a "doc" directory, which will contain the old Par2 specification and the new specification.
The Tornado Codes will need a predictable pseudo-random number generator. I expect I will use a version of Linear Congruential Generator.
The big question I have is: what do we name the next version and do we want to add a new file extension? At this moment, I plan on keeping all of Par2's packets and just adding new recovery packets. This will mean that par2 clients will still be able to verify the file, but will not be able to fix it. Unfortunately, par2cmdline currently silently ignores any packet type it does not recognize. So, existing users won't know why they cannot fix it. I would normally call the new specification Par2.1 or Par3, except the name "Par3" has been used by the developer of MultiPar. Perhaps we should call it "Par4"?
When we decide on a new name, I'll push a new branch and everyone can take a look at the spec/code.
Mike