Cannot repair file even tough there's enough block

apalazzi commented 3 years ago

Hi,

I'm trying to restore a corrupted file, however par2restore is unable to repair the file:

root@atlante:~/par2-verify# par2repair -a Full-0005 | tee repair.log
Loading "Full-0005.par2".   
Loaded 4 new packets
Loading "Full-0005.vol000+001.par2".
Loaded 1 new packets including 1 recovery blocks
Loading "Full-0005.vol127+128.par2".
Loaded 128 new packets including 128 recovery blocks
Loading "Full-0005.vol015+016.par2".
Loaded 16 new packets including 16 recovery blocks
Loading "Full-0005.vol007+008.par2".
Loaded 8 new packets including 8 recovery blocks
Loading "Full-0005.vol063+064.par2".
Loaded 64 new packets including 64 recovery blocks
Loading "Full-0005.vol511+066.par2".
Loaded 66 new packets including 66 recovery blocks
Loading "Full-0005.vol031+032.par2".
Loaded 32 new packets including 32 recovery blocks
Loading "Full-0005.vol003+004.par2".
Loaded 4 new packets including 4 recovery blocks
Loading "Full-0005.vol001+002.par2".
Loaded 2 new packets including 2 recovery blocks
Loading "Full-0005.vol255+256.par2".
Loaded 256 new packets including 256 recovery blocks
Loading "Full-0005.par2".
No new packets found

There are 1 recoverable files and 0 other files.
The block size used was 9663672 bytes.
There are a total of 2000 data blocks.
The total size of the data files is 19327343417 bytes.

Verifying source files:

Opening: "Full-0005"
Target: "Full-0005" - damaged. Found 1999 of 2000 data blocks.

Scanning extra files:

Repair is required.
1 file(s) exist but are damaged.
You have 1999 out of 2000 data blocks available.
You have 577 recovery blocks available.
Repair is possible.
You have an excess of 576 recovery blocks.
1 recovery blocks will be used to repair.

Computing Reed Solomon matrix.
Constructing: done.
Solving: done.

Wrote 19327343417 bytes to disk

Verifying repaired files:

Opening: "Full-0005"
Target: "Full-0005" - damaged. Found 1999 of 2000 data blocks.
Repair Failed.

Running with -vvvv does not add any meaningful info, let me know if you want me to do some specific test.

Andrea

mdnahas commented 3 years ago

Weird.

The only suggestion I can say is try giving it just 1 recovery block, that is, just file Full-0005.vol000+001.par2. If it is a problem with the Reed-Solomon Matrix inversion, that will make it very simple. If that doesn't work, try one of the other files.

If those do not work, there's not much to go on. You could try downloading a different PAR recovery program. They actually use different code to do the repair, so they may work if you're hitting a bug in this version.

apalazzi commented 3 years ago

Hi,

Still no luck:

andrea@atlante:~/par2-verify$ ls                                                                                                                                    
Full-0005  Full-0005.par2  Full-0005.vol000+001.par2  tmp                                                                                                           
andrea@atlante:~/par2-verify$ par2repair Full-0005                                                                                                               
Loading "Full-0005.par2".                                                                                                                                           
Loaded 4 new packets                                                                                                                                                
Loading "Full-0005.vol000+001.par2".                                                                                                                                
Loaded 1 new packets including 1 recovery blocks                                                                                                                    
Loading "Full-0005.par2".                                                                                                                                           
No new packets found                                                                                                                                                

There are 1 recoverable files and 0 other files.                                                                                                                    
The block size used was 9663672 bytes.                                                                                                                              
There are a total of 2000 data blocks.                                                                                                                              
The total size of the data files is 19327343417 bytes.                                                                                                              

Verifying source files:                                                                                                                                             

Opening: "Full-0005"                                                                                                                                                
Target: "Full-0005" - damaged. Found 1999 of 2000 data blocks.                                                                                                      

Scanning extra files:                                                                                                                                               

Repair is required.                                                                                                                                                 
1 file(s) exist but are damaged.                                                                                                                                    
You have 1999 out of 2000 data blocks available.                                                                                                                    
You have 1 recovery blocks available.                                                                                                                               
Repair is possible.
1 recovery blocks will be used to repair.

Computing Reed Solomon matrix.
Constructing: done.
Solving: done.

Wrote 19327343417 bytes to disk

Verifying repaired files:

Opening: "Full-0005"
Target: "Full-0005" - damaged. Found 1999 of 2000 data blocks.
Repair Failed.

I'll try with another program. In the meantime, if you can give me some hint I could try to run the program through the debugger and see if I can catch a bug.

apalazzi commented 3 years ago

So far I've tried with QuickPar, MultiPar and phpar, however none of them succeeded. I'm also under the impression that all those programs are in a way or another just a fork of par2cmdline, so if this is a bug in the core functions it's present in all of them.

Do you know of a par recover program that has for sure a different core code?

mdnahas commented 3 years ago

I believe par2cmdline was originally written by the same author as QuickPar, but he made big improvements to his program. I'm positive MultiPar is different. I don't know about phpar.

I'm the designer of the math for Par2. I don't know much about the code for par2cmdline. I don't know what to say. As a random thought --- a very random thought --- is it possible the file is set read-only or you don't have permissions to write the file?

Beyond that, I'm afraid that I am not much help. You're welcome to download and compile the code and add your own debugging info.

animetosho commented 3 years ago

I'm positive MultiPar is different. I don't know about phpar.

phpar2 is forked from par2cmdline. Multipar is different, though the author claims it's originally a C port of par2cmdline, so may be inspired by the code base to some degree.

The only other completely different implementation I know of is gopar, but it seems to be more of a proof-of-concept rather than an "real world application". You could give it a spin, but I wouldn't expect it to work miracles.

apalazzi commented 3 years ago

I confirm that with multipar it still doesn't work; I've also tried with gopar but I have an error (see here ).

Could it be that the big size of the archive is the source of the issue? The main file "Full-0005" is 15G and the block size is >9 M.

mdnahas commented 3 years ago

The limits in the spec are for 64-bit file lengths. Some clients may not use 64-bit values to store them ... but that would violate the spec. I don't suppose you're working with an older filesystem that limits file lengths to 2 or 4GB?

Part of the spec is that every file contains a packet that says which client created the PAR2 file. On an error, a client is supposed to print out the contents of that packet, so that we can track down a client that makes a bad file. I find it strange that par2cmdline isn't printing it --- we should fix that. For the moment, you could try running "strings" or a hex-editor on the smallest PAR2 file and see if it contains the name of the client. It should be right after the text "PAR 2.0\0Creator\0".

If you can find out the program that created the file, you could try using that to repair.

apalazzi commented 3 years ago

The recovery data was created with par2cmdline v0.7.4, here is the output from gopar:

Loaded file description packet for "Full-0005" (ID=0be49d13888f6c69ea09b2307d58f0dd, 19327343417 bytes)
Loaded checksums for file with ID 0be49d13888f6c69ea09b2307d58f0dd
Loaded main packet: slice byte count=9663672, recovery set size=1, non-recovery set size=0
Loaded creator packet with client ID "Created by par2cmdline version 0.7.4."
Hash mismatch for "Full-0005" (ID 0be49d13888f6c69ea09b2307d58f0dd)
[1/1] Loaded data file "Full-0005" (19327343417 bytes, 1999 hits, 9663672 misses)
Corrupt data chunk: "Full-0005" (ID 0be49d13888f6c69ea09b2307d58f0dd), bytes 18148376016 to 18158039687

I'll be trying with v0.7.4 and see if that makes a difference.

apalazzi commented 3 years ago

BTW the repair also fails with gopar...

apalazzi commented 3 years ago

To add some more info, with gopar the repair fails with the following message:

Loaded file description packet for "Full-0005" (ID=0be49d13888f6c69ea09b2307d58f0dd, 19327343417 bytes)
Loaded checksums for file with ID 0be49d13888f6c69ea09b2307d58f0dd
Loaded main packet: slice byte count=9663672, recovery set size=1, non-recovery set size=0
Loaded creator packet with client ID "Created by par2cmdline version 0.7.4."
Hash mismatch for "Full-0005" (ID 0be49d13888f6c69ea09b2307d58f0dd)
[1/1] Loaded data file "Full-0005" (19327343417 bytes, 1999 hits, 9663672 misses)
Corrupt data chunk: "Full-0005" (ID 0be49d13888f6c69ea09b2307d58f0dd), bytes 18148376016 to 18158039687
Loaded recovery packet: exponent=3, byte count=9663672
Loaded recovery packet: exponent=4, byte count=9663672
Loaded recovery packet: exponent=5, byte count=9663672
Loaded recovery packet: exponent=6, byte count=9663672
[1] Loaded volume file "Full-0005.vol003+004.par2"
Repair error: hash mismatch in reconstructed data

apalazzi commented 3 years ago

@mdnahas just for info, are you also following the bug report I submitted to gopar? The author is very reactive and willing to dig into this issue, and I think the data we're getting can be really useful to understand what's going on here.

Zrin commented 2 years ago

I've encountered probably the same issue while using par2cmdline v. 0.8.1 on Debian 11.4 (bullseye) with ZFS for DCPs which include files with size over 20 GB. par2cmdline seems to behave erratically. In all cases there are plenty repair blocks available and only few are needed.

The first repair run shows 4 damaged files, each of them with 1 to 3 needed repair blocks. It claims than that 3 of the 4 files were repaired and finishes with "Repair Failed."
The second repair run finds that another file is damaged, not the one shown as damaged in the 1st run. It runs the repair and claims success. "Repair complete."
However, a subsequent verify run shows that the file that was detected as damaged in the 1st run is still damaged.

Shall I post further details here or open a new issue? par2_problem.txt

apalazzi commented 2 years ago

Hi @Zrin , in my case I strongly suspect that the cuplrit was a faulty memory module and the redundancy data was incorrect right from the start, thus making impossible the recovery. I recommend that you run a memory testing program and see if your ram is good or faulty, especially if your're not using ECC ram.

Zrin commented 2 years ago

Hi @apalazzi, I think faulty RAM is very unlikely the cause because it seems that I can reproduce the issue, and even if the error correction blocks would be faulty, the tool should not attempt to repair wrong file(s). Nevertheless, I'll see what memtester will report. I'll also run more tests on different machines.

Zrin commented 2 years ago

It seems so far that there are issues with the SATA controller on the system where I've experienced the problems. Nevertheless, par2cmdline could be more resilient in such situation. I'll run more tests to confirm.

animetosho commented 1 year ago

Happened to look at the inversion code. par2cmdline has an assert if a failure occurs, so will actually crash if the known PAR2 math defect is encountered.
On the other hand, MultiPar seems to ignore a recovery block and retry, when such issue occurs.

Since par2cmdline didn't crash in these cases, I suspect the most likely cause to be related to bad memory during PAR2 create, as suggested by apalazzi. MultiPar and ParPar have memory built-in memory checksumming to try to detect these sorts of issues, though there's only so much software can do against a hardware fault.

Faulty disks (or related, such as bad I/O controllers) can be a mixed bag. If it happens during create, the bad data should be caught by the checksum when verifying/repairing. If the I/O fault occurs during repair instead, you could get odd behaviour - in such a case, I don't think software can do much about it other than report that something isn't right.

All in all, it's best to work on reliable hardware. Unfortunately, you generally need it to be reliable during create, the result of which often goes untested until you need to repair.
MultiPar/ParPar can provide a little extra margin of safety with PAR2 creation, if this is a concern.

Zrin commented 1 year ago

Faulty disks (or related, such as bad I/O controllers) can be a mixed bag.

The issue I've encountered was that the controller (on the mainboard) delivered corrupted data under certain circumstances. Reading from the same (huge) file multiple times gave different data. Replacing the mainboard solved the issue.

To detect that, one can checksum the file(s) multiple times and compare. A very "careful" tool might do that when a problem is detected or on user's request.

Thank you all for responding!

animetosho commented 1 year ago

To detect that, one can checksum the file(s) multiple times and compare.

That might work for your specific case, but not guaranteed for others. Repeatedly reading the file may not even do much, e.g. if it's cached by the OS or by some RAID controller or the like.
And that ignores the fact that it'd greatly reduce performance, and hence generally be undesirable.

I think trying to find the cause for fault is out of scope for a PAR2 tool. There's all sorts of things that could go wrong (e.g. bug in par2cmdline, bad PAR2 file, OS bugs, hardware faults etc) and it'd be extremely difficult/impossible to try to check everything.
To me, it makes the most sense for the tool to detect a fault, report it and leave it up to the user to troubleshoot.

Zrin commented 1 year ago

it'd be extremely difficult/impossible to try to check everything. To me, it makes the most sense for the tool to detect a fault, report it and leave it up to the user to troubleshoot.

Exactly. It should be sufficient to check the file(s) checksum(s) before reporting successful repair and alert the user that there is something unexpected going on.

animetosho commented 1 year ago

par2cmdline already does that.
Of course, with unreliable hardware, there's no guarantee.

If the built-in post-repair checksum check doesn't feel good enough to you, you're free to run a subsequent verification pass. Of course, still no guarantee if we're talking unreliable hardware.

Parchive / par2cmdline

Cannot repair file even tough there's enough block #156