Parchive / par2cmdline

Official repo for par2cmdline and libpar2
http://parchive.sourceforge.net
GNU General Public License v2.0
709 stars 72 forks source link

Unintuitive Behavior: Recovery Packets are 68 bytes larger than block size #159

Open EmperorArthur opened 3 years ago

EmperorArthur commented 3 years ago

Hello,

This may not be a "bug," as the code is (probably) working correctly. However, the output does not perform as the documentation says it should.

While a recovery block is created with the size specified, the 64 byte packet header, plus 4 byte exponent. This means the created packet will not fit into a given amount of space unless those 68 bytes are taken into account.

When posting on UseNet it is recommended that you use the -s option to set a blocksize that is equal to the Article size that you will use to post the data file. If you wanted to post the test.mpg file using an article size of 300 KB then the command you would type is:

par2 create -s307200 test.mpg.par2 test.mpg

Background

I have recently begun work on (yet another) system of backing up data via QR codes. As part of my research, I examined different ways of effectively storing metadata along with the ability to recover errors. This led me to both "tar" and this project. Tar stores data in 512 byte chunks. However, in attempting to make sure that my recovery data was as resilient as possible and everything was packed properly, I discovered that I had to set -s444 instead of the expected 512!

EmperorArthur commented 3 years ago

Note: This may not be relevant unless someone is at the exact edge of what can be uploaded at once. 68 bytes does not mean much when working in 1k chunks, but at the extremes this does become an issue.

animetosho commented 3 years ago

It sounds like the quote you included matches up with the behaviour. The point of the recovery packet is not to fit in the space specified, it's to cater for damage to the size specified.

For the Usenet application, the purpose is to cater for articles which go missing. If articles are 300KB long (pre-yEnc), then a 300KB block size is optimal, as one missing article only needs one recovery block.
The actual recovery packet within the PAR2 file, will of course be 300KB + 68 bytes. The fact that this doesn't fit within a Usenet article doesn't matter.

PAR2 isn't really designed for fitting within a given block size. You've found one case, but another would be other packets which must be in a PAR2 file, which don't have size/positioning guarantees.
The specification doesn't specify how packets should be ordered, or duplicated, so it's entirely possible that you get, say, a filename packet right at the beginning of the PAR2 file, meaning that no subsequent recovery packet will align to any target block boundary.

EmperorArthur commented 3 years ago

The specification doesn't specify how packets should be ordered, or duplicated, so it's entirely possible that you get, say, a filename packet right at the beginning of the PAR2 file, meaning that no subsequent recovery packet will align to any target block boundary.

I was actually about to open that as a feature request. I personally don't need it for my niche application, since I have already written a basic Python par2 parser that aligns everything and uses Bin Packing and padding. However, it would be useful for other things.

The downside of such a scheme is in the best case scenario, metadata size baloons to the minimum of chunk size, and I know larger par2 archives can have multiple copies of the metadata. That's not really a problem when working at smaller sizes like I am, but becomes an issue at larger recovery block sizes.


Par2 has, obviously, worked well in its use case for years. However, at least noting the block size issue in the documentation and having some sort of alignment option may help with some of the use cases mentioned in the Readme.

For example, storing parity information on DVDs and Blu-Rays is explicitly mentioned as an application. Having the option of creating a file where data falls neatly into a DVD's 2KiB sectors could aid in data recovery.

Edit:

First, I would like to confirm that par2cmdline will happily read files with padding added between packets. So, good job there!

Second, you can actually think of my application as the same as the DVD example, but with far less data and a ludicrous amount of redundancy. It's just the whole 68 bytes thing is a "gotcha" until you read the spec.

Which thank you, by the way. Compared to most specification documents I have read, the file format and associated cpp are extremely clear and easy to understand.

animetosho commented 3 years ago

I don't really know your specific application, but the example you give sounds a bit off to me.

I see little reason why you'd specifically want the recovery blocks with metadata to be 2KB for your DVD example. Generally, you want the input block size to be 2KB, and hence the recovery block should be greater than 2KB.
If you set the input block size to 1980 bytes, to ensure the recovery block + metadata is 2048 bytes, then a single damaged sector will require multiple recovery blocks to fix, i.e. reduces the efficiency of PAR2.

PAR2 probably isn't great with really small amounts of data. For one, 68 bytes out of 512 is quite some overhead. For another, there's a limit of 32768 input blocks, so if you're using a 444 byte block size, the most data you could feed PAR2 would be 13.875MB.

EmperorArthur commented 3 years ago

Ouch. You are correct.

It's just unfortunate that a tag is both needed, and makes alignment difficult.

Personally, I feel that just giving up completely is also not the best approach though. The question is always data amount vs integrity, and I believe alignment can help.

I'll look at doing a quick PR for something in the docs directory if that's okay. That way this isn't lost on anyone else attempting to use par2 like I am.

animetosho commented 3 years ago

I don't know your exact aim, but what you've indicated, what you want may be possible, but probably not through PAR2.

For disclosure purposes, I don't maintain this repo or code base, so have no say over what will be accepted in terms of PRs.

mdnahas commented 3 years ago

I agree this is an issue. We need to talk about an "internal block size", that Par2 uses for its calculations, and an "external block size", that the packets need to fit into. Our users probably care more about the external block size, than the internal.

I'm still thinking about Parchive version 3 at the moment. We should mention this in the new spec.