Chia-Network / chia-blockchain

Chia blockchain python implementation (full node, farmer, harvester, timelord, and wallet)
Apache License 2.0
10.82k stars 2.03k forks source link

Invalid plot header magic, plots full of garbage, bug? #5422

Closed rthorntn closed 3 years ago

rthorntn commented 3 years ago

Hi,

tl:dr single server, getting “Invalid plot header magic” on 8 of my 14 plots, server hardware, ECC RAM, datacentre NVMe…Ubuntu 20.04…no ECC errors and all disks pass SMART…using three NVMe and five datacentre HDD’s, the issue is found on plots from all three SSDs and on all five HDD’s.

If it’s a hardware error it’s an obscure one, or multiple failures.

More detail:

Dual Xeon, DDR4 ECC Type: 8x32GB Multi-bit ECC (edac-util -rfull output: mc0:noinfo:all:UE:0 / mc0:noinfo:all:CE:0 / mc1:noinfo:all:UE:0 / mc1:noinfo:all:CE:0)

3 x nvme drives (plotting)

drive 1 - 3 plots drive 2 - 3 plots drive 3 - 1 plots

1 x nvme drives (-2) - drive that all temp files get written to before going to HDD (brand new drive)

5 x 14T sata hdd (farming) new drives

drive 1 - 4 plots (1 bad, from sequence 2) [nvme drive 1 & 2 plots come here] drive 2 - 2 plots (2 bad, one from each sequence) [nvme drive 3 plots come here] drive 3 - 4 plots (2 bad, from sequence 1) [nvme drive 1 plots come here] drive 4 - 2 plots (1 bad, from sequence 1) [nvme drive 2 plots come here] drive 5 - 2 plots (2 bad, one from each sequence) [nvme drive 2 plots come here] 14 plots complete (7 in sequence 1, 7 in sequence 2), 8 bad (4 in sequence 1, 3 in sequence 2) wtf!!!

Error:

2021-05-18T08:09:34.615 chia.plotting.plot_tools : ERROR Failed to open file /mnt/field02/plot-k32-2021-05-17-06-57-xxx.plot. Invalid plot header magic Traceback (most recent call last): File “/home/rthorntn/chia-blockchain/chia/plotting/plot_tools.py”, line 189, in process_file prover = DiskProver(str(filename)) ValueError: Invalid plot header magic

So I googled and I read disk issues and possibly RAM.

I have ECC right so it shouldn’t be that?

I didn’t get 8 plots from any one nvme, I guess all nvme drives could be bad?

The brand new Intel 750 could be bad but because it’s a single point of failure wouldn’t it corrupt all plots?

I have failed plots on all HDD drives, surely all 5 drives can’t be bad?

Cosmic rays, SATA bus corruption, SATA controller issue, who knows?

A bug, any other way to verify the plots?

I’m pretty pissed that I only have 6 good plots out of 14, less than 50% success rate.

Please help, lol, preferably in a way that will get all of my 14 plots to pass the check…

Should I just stop plotting until I figure it out, who knows…

I just lowered the chia plots RAM from 8000 to 4000

I just changed -2 to be the same drive as -t

Command: screen -d -m -S chia01 bash -c ‘cd /home/xxx/chia-blockchain && . ./activate && sleep 0h && chia plots create -k 32 -b 4000 -e -r 4 -u 128 -n 32 -t /mnt/1600gb_1/temp1 -2 /mnt/1600gb_1 -d /mnt/field01 |tee /home/rthorntn/chialogs/chia011.log’

Here goes I will check in 10 hours to see if it made any difference.

With this command:

hexdump -c plot-xxx.plot | less

Working plots show:

0000000 P r o o f o f S p a c e P 0000010 l o t

Bad plots don't have that and use the same 102GB like the others.

$ hexdump plot-k32-2021-05-17-06-57-good.plot

0000000 7250 6f6f 2066 666f 5320 6170 6563 5020 0000010 6f6c 4074 a998 9ed1 5637 7a8b 0dd4 ee82 0000020 a75d cce9 7566 6246 08ee 0c4e 3163 7da6 0000030 0853 2024 0400 3176 302e 8000 f696 c07b 0000040 fed0 ef51 1cf4 6635 9a30 c54f 72db 0fe3 0000050 721a 4572 b887 0a20 ee5e 8d86 260e dd1b

$ hexdump plot-k32-2021-05-17-17-12-bad.plot

0000000 0000 0000 0000 0000 0000 0000 0000 0000 00000e0 0000 0000 0000 1000 26da 4487 0000 1400 00000f0 40eb 047a 0000 1900 bc0b 043a 0000 1900 0000100 d60b 0074 0000 1900 d60b b074 0000 0000 0000110 0000 0000 0000 0000 0000 0000 0000 0000 <I have to ^C on the bad hexdump as the cursor just freezes on the next line after the * at the bottom>

The bad plot looks like 102GB of absolute garbage.

I could handle corruption but why would the files be empty.

Thanks. Richard

rthorntn commented 3 years ago

OK so the latest sequence of 7 plots just completed and all check out, so it looks like either removing the seperate -2 nvme drive from the equation or lowering the RAM from 8000 to 4000 might of fixed it, I say that because I guess I could have the plots “go bad” over time issue and some of my older plots will go bad, who knows, I only started checking plots after the 2nd sequence had completed so I don’t know if the 8 plots started bad or went bad.

Will be keeping a close eye on it.

rthorntn commented 3 years ago

One of those two changes I made fixed it