Repair-Files created amount to size twice as high as expected from redundancy

Parchive / par2cmdline

Official repo for par2cmdline and libpar2

GNU General Public License v2.0

714 stars 73 forks source link

Hi, I am unsure if this is the right way to ask a question like this, if it is not please let me know and I will delete this post ;) The question/issue I am having is, that if I e.g. generate an arbitrary file of exactly 10.000.000 bytes and generate the repair-files with the -r10 for 10% I get 8 repair files that amount to a total size of 2.424.932 bytes. This makes about 24% of the original data as redundancy, right? If I try to limit the output by -rm1 it produces files with exactly the same sizes. So it seems to make a redundancy of x % available there need to be repair files of size c * x % of the original file size. Is there some heuristic on how one can guesstimate this factor c? For me it would be quite handy if I could at least roughly estimate it before I run the whole procedure on a file of e.g. 70GB. How would one approach such a situation? Nevertheless, thanks a lot for your effort to make this great application available!

Edit: I created a small nodejs script for generating and messing with datafiles

var fs = require("fs")
var writeString = ""
for(var i = 0; i < parseInt(process.argv[2]); i++){
    writeString += "a"
}
var charflips = 0
for(var i = parseInt(process.argv[2]); i < parseInt(process.argv[3]); i++){
    if(Math.random() < 0.0001){
        writeString += "b"
        charflips++
    }
    else{
        writeString += "a"
    }
}
console.log("Did " + charflips + " charflips")
fs.writeFileSync("out.txt", writeString)

usage: node testPar.js [endOfCleanBytes] [endOfPossiblyMessyBytes] e.g. node testPar.js 10000000 0 (generates 10000000 clean bytes) par2 create -r10 out.par2 out.txt node testPar.js 9000000 1000000 (generates 9000000 clean bytes followed by 1000000 possibly messy bytes) par2 verify out.par2

The 10% refers to the amount of recovery data. There will always be additional metadata / overhead and the like which means that the final PAR2 will be larger than just the recovery data.

This makes about 24% of the original data as redundancy, right?

No, you only have 10% redundancy.

By default, par2cmdline uses 2000 input blocks and generates 200 recovery blocks. For a 10MB file, that's 5000 bytes per block (which is rather small). Each recovery block needs a block header, but most of your overhead is likely from the IFSC packet which contains checksums for each input block (if my memory is correct, MD5 and CRC32, so 20 bytes per input block, or 40KB for 2000 inputs). Furthermore, this is considered a 'critical packet' and must be duplicated multiple times to avoid corruption breaking it.

In short, because you're using very small block sizes, the PAR2 size will be dominated by overhead.

If you wish to be able to calculate the final size of the PAR2, you can read the PAR2 specifications. The format is relatively strictly defined, so you can get a rough idea from that. Unfortunately, it doesn't define what scheme to use for duplicating packets, nor does it define how files must be split, which means the exact size (and number of files) is implementation specific, and you'll need to look at par2cmdline's code to see what exactly it does.

Alternatively, since it looks like you're using node.js, you may be able to use some of the code I've used for my ParPar client. There's no documentation at the moment, but if you're willing to follow code, you could create a PAR2Gen instance and then sum up the sizes of all the packets from PAR2Gen.recoveryFiles.

Hope that helps.

Parchive / par2cmdline

Repair-Files created amount to size twice as high as expected from redundancy #121