animetosho / ParPar

High performance PAR2 create client for NodeJS
190 stars 19 forks source link

Recovery not possible from (large) generated par2 files #26

Closed MeFri closed 3 years ago

MeFri commented 3 years ago

After generating a large par2 file (from 3 large input files) I can't seem to recover data.

Par2 file generation with ParPar:

$ parpar.js -s1000 -r$PARRATIO% -m 8G -F1 -o pardata.par2 data*
Calculating parity information...
Multiply method used: Shuffle2x (AVX2), 12 threads
Generating 630.16 GiB recovery data (100 slices) from 6297.68 GiB of data
Calculating: 100.00%
PAR2 created. Time taken: 39234.39 second(s)

damage one of the input files:

dd if=/dev/zero of=data02 conv=notrunc bs=1M count=300k

Recovery with par2cmdline:

$ par2cmdline/par2 v pardata.par2
Loading "pardata.par2".
Loaded 8 new packets
Loading "pardata.vol000+100.par2".
Loaded 100 new packets including 100 recovery blocks
Loading "pardata.par2".
No new packets found

There are 3 recoverable files and 0 other files.
The block size used was 6766304348 bytes.
There are a total of 1000 data blocks.
The total size of the data files is 6762085150752 bytes. 

Verifying source files:

Opening: "data01"
Opening: "data00"
Target: "data01" - found.
Target: "data00" - found.
Opening: "data02"
Target: "data02" - damaged. Found 216 of 264 data blocks.

Scanning extra files:

Repair is required.
1 file(s) exist but are damaged.
2 file(s) are ok.
You have 952 out of 1000 data blocks available.
You have 100 recovery blocks available.
Repair is possible.
You have an excess of 52 recovery blocks.
48 recovery blocks will be used to repair.
Command exited with non-zero status 1

Recovery with par2tbb:

$ par2tbb/par2 v pardata.par2 
par2cmdline version 0.4, Copyright (C) 2003 Peter Brian Clements.
Modifications for concurrent processing, Unicode support, and hierarchial
directory support are Copyright (c) 2007-2009 Vincent Tan.
Concurrent processing utilises Intel Thread Building Blocks 2.0,
Copyright (c) 2007-2008 Intel Corp.
Executing using the 64-bit x86 (AMD64) instruction set.

par2cmdline comes with ABSOLUTELY NO WARRANTY.

This is free software, and you are welcome to redistribute it and/or modify
it under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or (at your
option) any later version. See COPYING for details.

Processing verifications and repairs concurrently.
Loading "pardata.par2".
Loaded 8 new packets

There are 3 recoverable files and 0 other files.
The block size used was 6766304348 bytes.
There are a total of 1000 data blocks.
The total size of the data files is 6762085150752 bytes.

Verifying source files:

Could not read 13532608696 bytes from data02 at offset 0
Could not read 13532608696 bytes from data00 at offset 0
Could not read 13532608696 bytes from data01 at offset 0

Recovery with par2j:

$ wine multipar/par2j64.exe v pardata.par2 
fixme:heap:HeapSetInformation 0x3c4000 0 0x22fe10 4
Parchive 2.0 client version 1.3.1.1 by Yutaka Sawada

fixme:file:GetLongPathNameW UNC pathname L"\\\\?\\D:\\pardata.par2"
Base Directory  : "D:\"
Recovery File   : "D:\pardata.par2"
CPU thread      : 12 / 12
CPU cache       : 768 KB per set
CPU extra       : x64 SSSE3 CLMUL AVX2
Memory usage    : Auto (115751 MB available)

PAR File list :
         Size :  Filename
        20876 : "pardata.par2"
 676630566276 : "pardata.vol000+100.par2"

PAR File total size     : 676630587152
PAR File possible count : 2

valid file is not found

Are there some limits for par2 files in terms of file sizes? Same procedure works for smaller (e.g 5Gb) files.

animetosho commented 3 years ago

Thanks for reporting that.

From the information, it looks like your slice size is around 6.3GB. I've had issues with testing slices larger than 4GB - my current test script works by comparing the output of ParPar with that from par2cmdline, which limits slices to 4GB when creating, and par2j limits slices to 1GB. These limits are on create, so I don't know whether they affect repair in any way, but it could be a sign.

The PAR2 format doesn't impose these limits, and ParPar doesn't hard limit slice size, though its handling of large slices currently isn't great (which I do plan to improve, though, no PAR2 tool currently handles large slices well).

My recommendation would be to try to keep slices below 4GB if possible. In your example, you've got 6.3TB of input data, spread across 1000 slices, giving the 6.3GB slice size. If you increase the slice count to something like 2000 slices, it should put it under 4GB.
By the way, if you want to more quickly test this theory without having to wait hours, you can use smaller dummy files (e.g. a 16GB source file), generate 2 recovery slices and select the slice size you want to test against.

animetosho commented 3 years ago

So I've done some testing with large slices:

par2j64 v1.3.1.2: 1GB slices work, 1GB + 4 byte slices fails. So basically, it adheres to its own maximum slice size
par2cmdline v0.8.0 Windows x64: 2GB - 4 bytes works, 3GB fails (this includes the PAR2 created by par2cmdline itself)
par2cmdline v0.8.1 Linux x64: 3GB works, 6GB fails

So it seems that keeping it under 4GB isn't always enough. Going above 1GB is enough to cause incompatibilities with some applications. I haven't tested 32-bit versions, or older applications (like par2cmdline forks such as par2tbb), but I'm hoping with the above being within 32-bit range, that it's the same.
I'll add a warning to ParPar if a slice size above 1GB is chosen. For your particular case, if you want maximum compatibility, you'll have to increase the slice count so that the slice size falls under 1GB. Otherwise, par2cmdline can probably go up to 4GB minus 4 bytes.

Interestingly, it looks like you can create PAR2 files with slices exceeding 4GB in par2cmdline, if you set it up so that you don't specify the slice size directly.

MeFri commented 3 years ago

Thank you for your insight and testing! Looks like I'll stick to block sizes < 1GB for better compatibility.

animetosho commented 3 years ago

Bit of an update on slice size testing, for anyone finding this thread:

Re-tested par2cmdline v0.8.1 Linux x64 (actually WSL), and 6GB slices seem to work here. I wonder if it's a RAM thing - the previous test was done using a single 16GB input file with 2 recovery slices, on a system with 16GB RAM. My re-test was done on a single 8GB input file on a system with 32GB RAM.
Haven't looked much into it, but my guess is that par2cmdline 64-bit Linux builds can handle >4GB slices provided you have sufficient RAM (and don't explicitly set a slice size as that'll hit their hard-coded 4GB check).

32-bit x86 PAR2 clients seem to be rather flaky, possibly due to limited address space. par2j seems to handle 1GB slices, but others like par2tbb don't handle it well, and need smaller slice sizes. QuickPar seems to have a limit of 100,000,000 bytes (i.e. 100,000,004 byte slices causes crashes for me).
Hopefully 32-bit PAR2 clients are on their way out though.