Chia-Network / chia-blockchain

Chia blockchain python implementation (full node, farmer, harvester, timelord, and wallet)
Apache License 2.0
10.81k stars 2.03k forks source link

[BUG] Plotter segfault for some plots #504

Closed eFishCent closed 3 years ago

eFishCent commented 3 years ago

Describe the bug Plotter segfault in phase 1

[121795.335173] chia[78466]: segfault at 40 ip 00007f2f7fb16a6c sp 00007f2f7b248670 error 6 in chiapos.cpython-38-x86_64-linux-gnu.so[7f2f7fadf000+71000]
[121795.335201] Code: c0 02 00 00 48 8d 1c 80 48 c1 e3 04 48 03 9c 24 a0 02 00 00 80 7c 24 2f 01 48 83 9c 24 b0 00 00 00 ff 80 bc 24 2e 01 00 00 80 <c6> 43 40 01 0f 86 3a fc ff ff 48 8b 73 30 48 8b 53 38 8b 8c 24 b8

To Reproduce Start plotting process with following:

ID: faf41e4e666e5b5a932b665fb032049b74d35ae4a18eba30f06f89dbab66e0b2
Plot size is: 32
Buffer size is: 5000MiB
Using 128 buckets
Using 1 threads of stripe size 65536

Expected behavior Completed plot

Relevant log file lines Crash occurred at following point:

Starting phase 1/4: Forward Propagation into tmp files... Mon Nov  2 05:58:17 2020
Computing table 1
F1 complete, time: 189.863 seconds. CPU (99.39%) Mon Nov  2 06:01:27 2020
...
Forward propagation table time: 4354.224 seconds. CPU (98.820%) Mon Nov  2 08:15:36 2020
Computing table 4
        Bucket 0 uniform sort. Ram: 3.620GiB, u_sort min: 3.250GiB, qs min: 0.813GiB.
        Bucket 1 uniform sort. Ram: 3.620GiB, u_sort min: 1.625GiB, qs min: 0.812GiB.
...
        Bucket 76 uniform sort. Ram: 3.620GiB, u_sort min: 1.625GiB, qs min: 0.812GiB.
        Bucket 77 uniform sort. Ram: 3.620GiB, u_sort min: 3.250GiB, qs min: 0.813GiB.

Segmentation fault      (core dumped)

Desktop (please complete the following information):

Additional context Contact @eFishCent on keybase for more info.

wjblanke commented 3 years ago

Not much to go on but adding this

37a0f: 0f 29 56 f0 movaps %xmm2,-0x10(%rsi) 37a13: 48 89 b4 24 88 02 00 mov %rsi,0x288(%rsp) 37a1a: 00 37a1b: 49 83 c6 04 add $0x4,%r14 37a1f: 4c 39 b4 24 80 00 00 cmp %r14,0x80(%rsp) 37a26: 00 37a27: 0f 84 e6 07 00 00 je 38213 PyInit_chiapos@@Base+0x53f3 37a2d: 41 0f b7 06 movzwl (%r14),%eax 37a31: 4c 8d 24 80 lea (%rax,%rax,4),%r12 37a35: 41 0f b7 46 02 movzwl 0x2(%r14),%eax 37a3a: 49 c1 e4 04 shl $0x4,%r12 37a3e: 4c 03 a4 24 c0 02 00 add 0x2c0(%rsp),%r12 37a45: 00 37a46: 48 8d 1c 80 lea (%rax,%rax,4),%rbx 37a4a: 48 c1 e3 04 shl $0x4,%rbx 37a4e: 48 03 9c 24 a0 02 00 add 0x2a0(%rsp),%rbx 37a55: 00 37a56: 80 7c 24 2f 01 cmpb $0x1,0x2f(%rsp) 37a5b: 48 83 9c 24 b0 00 00 sbbq $0xffffffffffffffff,0xb0(%rsp) 37a62: 00 ff 37a64: 80 bc 24 2e 01 00 00 cmpb $0x80,0x12e(%rsp) 37a6b: 80 *** SEGFAULT 37a6c: c6 43 40 01 movb $0x1,0x40(%rbx) 37a70: 0f 86 3a fc ff ff jbe 376b0 PyInit_chiapos@@Base+0x4890 37a76: 48 8b 73 30 mov 0x30(%rbx),%rsi 37a7a: 48 8b 53 38 mov 0x38(%rbx),%rdx 37a7e: 8b 8c 24 b8 00 00 00 mov 0xb8(%rsp),%ecx 37a85: 48 8b 7c 24 38 mov 0x38(%rsp),%rdi 37a8a: e8 81 89 ff ff callq 30410 _ZNSs6assignERKSs@@Base+0x11870 37a8f: 48 8b 73 20 mov 0x20(%rbx),%rsi 37a93: 48 8b 53 28 mov 0x28(%rbx),%rdx 37a97: b9 80 00 00 00 mov $0x80,%ecx 37a9c: 48 8b 7c 24 68 mov 0x68(%rsp),%rdi 37aa1: e8 6a 89 ff ff callq 30410 _ZNSs6assignERKSs@@Base+0x11870 37aa6: 66 0f 6f 94 24 90 04 movdqa 0x490(%rsp),%xmm2 37aad: 00 00 37aaf: 66 0f 6f 9c 24 a0 04 movdqa 0x4a0(%rsp),%xmm3 37ab6: 00 00 37ab8: 48 8b 84 24 e0 04 00 mov 0x4e0(%rsp),%rax 37abf: 00 37ac0: 0f b6 bc 24 e8 04 00 movzbl 0x4e8(%rsp),%edi 37ac7: 00 37ac8: 66 0f 6f a4 24 b0 04 movdqa 0x4b0(%rsp),%xmm4 37acf: 00 00 37ad1: 66 0f 6f ac 24 c0 04 movdqa 0x4c0(%rsp),%xmm5

wjblanke commented 3 years ago

The disassembly looks like jibberish to me, so I am leaning towards some kind of corruption.

Unfortunately I got past the error trying to reproduce the plot. Is it reproducible on your side? It's running a lot faster over here so maybe I missed something.

chiapos HEAD detached at 0.12.33

nohup ./ProofOfSpace -k 32 -u 128 -h 1 -b 5000 -i faf41e4e666e5b5a932b665fb032049b74d35ae4a18eba30f06f89dbab66e0b2 create &

Starting plotting progress into temporary dirs: . and . ID: faf41e4e666e5b5a932b665fb032049b74d35ae4a18eba30f06f89dbab66e0b2 Plot size is: 32 Buffer size is: 5000MiB Using 128 buckets Using 1 threads of stripe size 65536

...

Total matches: 4294966418

Forward propagation table time: 3377.315 seconds. CPU (77.300%) Thu Nov 12 17:27:53 2020 Computing table 4 ... Bucket 76 uniform sort. Ram: 3.620GiB, u_sort min: 1.625GiB, qs min: 0.812GiB. Bucket 77 uniform sort. Ram: 3.620GiB, u_sort min: 3.250GiB, qs min: 0.813GiB. Bucket 78 uniform sort. Ram: 3.620GiB, u_sort min: 1.625GiB, qs min: 0.812GiB.

eFishCent commented 3 years ago

faf41e4e666e5b5a932b665fb032049b74d35ae4a18eba30f06f89dbab66e0b2

Unfortunately I got past the error trying to reproduce the plot. Is it reproducible on your side? It's running a lot faster over here so maybe I missed something.

I tried to repro it, nada. I am guessing it was a memory corruption. Will let you know if I get more and try to narrow down repro steps.

LeonidShamis commented 3 years ago

My plotter got twice Segmentation fault (core dumped):

(venv) leonid@antec:~/chia-blockchain$ [4]+ Segmentation fault (core dumped) chia plots create -k 32 -n 10 -u 128 -b 4508 -r 4 -t /mnt/drive/c/plots/ -2 /mnt/drive/c/plots/ -d /mnt/drive/d/plots/ > antec-k32-ccd-u128-b4508-r4-date +%Y-%m-%d-%H-%M-%S.log 2>&1 (venv) leonid@antec:~/chia-blockchain$

(venv) leonid@antec:~/chia-blockchain$ dmesg | grep -B 2 -A 6 "segfault" [526650.381097] mce: CPU3: Core temperature/speed normal [527183.148246] show_signal_msg: 18 callbacks suppressed [527183.148249] chia[19256]: segfault at e00 ip 00007f329020ea6c sp 00007f328ec37670 error 6 in chiapos.cpython-38-x86_64-linux-gnu.so[7f32901d7000+71000] [527183.148265] Code: c0 02 00 00 48 8d 1c 80 48 c1 e3 04 48 03 9c 24 a0 02 00 00 80 7c 24 2f 01 48 83 9c 24 b0 00 00 00 ff 80 bc 24 2e 01 00 00 80 43 40 01 0f 86 3a fc ff ff 48 8b 73 30 48 8b 53 38 8b 8c 24 b8 [528538.468066] mce: CPU7: Core temperature above threshold, cpu clock throttled (total events = 28091215) [528538.468067] mce: CPU3: Core temperature above threshold, cpu clock throttled (total events = 28091262) [528538.469137] mce: CPU7: Core temperature/speed normal [528538.469138] mce: CPU3: Core temperature/speed normal [528844.127296] mce: CPU3: Core temperature above threshold, cpu clock throttled (total events = 28098009)

[544360.071912] mce: CPU7: Core temperature/speed normal [544360.071913] mce: CPU3: Core temperature/speed normal [544956.631469] chia[20199]: segfault at db0 ip 00007fba62d15a6c sp 00007fba615eb670 error 6 in chiapos.cpython-38-x86_64-linux-gnu.so[7fba62cde000+71000] [544956.631493] Code: c0 02 00 00 48 8d 1c 80 48 c1 e3 04 48 03 9c 24 a0 02 00 00 80 7c 24 2f 01 48 83 9c 24 b0 00 00 00 ff 80 bc 24 2e 01 00 00 80 43 40 01 0f 86 3a fc ff ff 48 8b 73 30 48 8b 53 38 8b 8c 24 b8 [544975.972915] mce: CPU3: Core temperature above threshold, cpu clock throttled (total events = 28324502) [544975.972916] mce: CPU7: Core temperature above threshold, cpu clock throttled (total events = 28324455) [544975.974015] mce: CPU3: Core temperature/speed normal [544975.974016] mce: CPU7: Core temperature/speed normal [545626.110845] mce: CPU3: Core temperature above threshold, cpu clock throttled (total events = 28327292) `

(venv) leonid@antec:~/chia-blockchain$ chia version 1.0b17 (venv) leonid@antec:~/chia-blockchain$

LeonidShamis commented 3 years ago

Happened 3rd time with the same parameters (different plot ID though each time) - see below. Similar process with only different -t /mnt/drive/c/plots/ -2 /mnt/drive/c/plots/ is running fine.

[4]+ Segmentation fault (core dumped) chia plots create -k 32 -n 10 -u 128 -b 4508 -r 4 -t /mnt/drive/c/plots/ -2 /mnt/drive/c/plots/ -d /mnt/drive/d/plots/ > antec-k32-ccd-u128-b4508-r4-date +%Y-%m-%d-%H-%M-%S.log 2>&1 (venv) leonid@antec:~/chia-blockchain$

[595409.964438] chia[21909]: segfault at 7c0 ip 00007fc7d3893a6c sp 00007fc7ce804670 error 6 in chiapos.cpython-38-x86_64-linux-gnu.so[7fc7d385c000+71000] [595409.964456] Code: c0 02 00 00 48 8d 1c 80 48 c1 e3 04 48 03 9c 24 a0 02 00 00 80 7c 24 2f 01 48 83 9c 24 b0 00 00 00 ff 80 bc 24 2e 01 00 00 80 43 40 01 0f 86 3a fc ff ff 48 8b 73 30 48 8b 53 38 8b 8c 24 b8

LeonidShamis commented 3 years ago

Happened 3rd time with the same parameters (different plot ID though each time) - see below. Similar process with only different -t /mnt/drive/c/plots/ -2 /mnt/drive/c/plots/ is running fine.

Here is my system information - fairly old system:

System: Host: antec Kernel: 5.4.0-52-generic x86_64 bits: 64 Desktop: N/A Distro: Ubuntu 20.04.1 LTS (Focal Fossa) Machine: Type: Desktop Mobo: ASUSTeK model: P7P55D-E PRO v: Rev 1.xx serial: <superuser/root required> BIOS: American Megatrends v: 0610 date: 12/10/2009 CPU: Topology: Quad Core model: Intel Core i7 860 bits: 64 type: MT MCP L2 cache: 8192 KiB Speed: 1204 MHz min/max: 1200/2801 MHz Core speeds (MHz): 1: 1205 2: 1212 3: 1229 4: 1228 5: 1226 6: 1278 7: 1299 8: 1297 Graphics: Device-1: Advanced Micro Devices [AMD/ATI] Juniper PRO [Radeon HD 5750] driver: radeon v: kernel Display: server: Colin Harrison X.org 11.0 driver: radeon resolution: 1536x864~N/A OpenGL: renderer: llvmpipe (LLVM 10.0.0 128 bits) v: 3.1 Mesa 20.0.8 Audio: Device-1: Intel 5 Series/3400 Series High Definition Audio driver: snd_hda_intel Device-2: Advanced Micro Devices [AMD/ATI] Juniper HDMI Audio [Radeon HD 5700 Series] driver: snd_hda_intel Sound Server: ALSA v: k5.4.0-52-generic Network: Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet driver: r8169 IF: enp5s0 state: up speed: 1000 Mbps duplex: full mac: e0:cb:4e:9b:5b:9b Drives: Local Storage: total: 1.06 TiB used: 6.48 GiB (0.6%) ID-1: /dev/sda vendor: Seagate model: ST3160827AS size: 149.05 GiB ID-2: /dev/sdb vendor: Seagate model: ST3500418AS size: 465.76 GiB ID-3: /dev/sdc vendor: Seagate model: ST3500418AS size: 465.76 GiB Partition: ID-1: / size: 145.71 GiB used: 6.48 GiB (4.4%) fs: ext4 dev: /dev/sda2 Sensors: System Temperatures: cpu: 36.0 C mobo: N/A gpu: radeon temp: 28 C Fan Speeds (RPM): N/A gpu: radeon fan: 0 Info: Processes: 170 Uptime: 8m Memory: 15.63 GiB used: 287.8 MiB (1.8%) Shell: bash inxi: 3.0.38

LeonidShamis commented 3 years ago

Happened 3rd time with the same parameters (different plot ID though each time) - see below. Similar process with only different -t /mnt/drive/c/plots/ -2 /mnt/drive/c/plots/ is running fine.

Happened again after system restart with the same -t /mnt/drive/c/plots/ -2 /mnt/drive/c/plots/ drive. Another process with -t /mnt/drive/b/plots/ -2 /mnt/drive/b/plots/ is running fine.

Could it be because of the target drive health? It is an old drive, probably pre-fail.

(venv) leonid@antec:~/chia-blockchain$ dmesg | grep -B 1 -A 5 segfault [61946.828597] show_signal_msg: 18 callbacks suppressed [61946.828600] chia[4182]: segfault at 90 ip 00007f7e3533fa6c sp 00007f7e33c55670 error 6 in chiapos.cpython-38-x86_64-linux-gnu.so[7f7e35308000+71000] [61946.828616] Code: c0 02 00 00 48 8d 1c 80 48 c1 e3 04 48 03 9c 24 a0 02 00 00 80 7c 24 2f 01 48 83 9c 24 b0 00 00 00 ff 80 bc 24 2e 01 00 00 80 <c6> 43 40 01 0f 86 3a fc ff ff 48 8b 73 30 48 8b 53 38 8b 8c 24 b8 (venv) leonid@antec:~/chia-blockchain$

wjblanke commented 3 years ago

I doubt it is related to the drive. The fact that the disassembly shows the crash happening in the same spot would seem to mean it isn't corruption. But I am not sure why we can't reproduce it using the same plotID. The disassembly seems to be in PyInit_chiapos which is pybind11 land and there are no exported symbols...

wjblanke commented 3 years ago

OK I got further by finding similar assembly in ProofOfSpace

_ZNSt6vectorI9PlotEntrySaIS0_EEixEm(): /usr/include/c++/9/bits/stl_vector.h:1043 (discriminator 2) return (this->_M_impl._M_start + __n); 14a8e: 48 8d 2c 80 lea (%rax,%rax,4),%rbp _Z13phase1_threadP10THREADDATA(): /home/wjblanke/covid/chiapos/src/phase1.hpp:354 (discriminator 2) PlotEntry& R_entry = bucket_R[idx_R]; 14a92: 41 0f b7 46 02 movzwl 0x2(%r14),%eax _ZNSt6vectorI9PlotEntrySaIS0_EEixEm(): /usr/include/c++/9/bits/stl_vector.h:1043 (discriminator 2) 14a97: 48 c1 e5 04 shl $0x4,%rbp 14a9b: 48 03 ac 24 d0 01 00 add 0x1d0(%rsp),%rbp 14aa2: 00 14aa3: 48 8d 1c 80 lea (%rax,%rax,4),%rbx 14aa7: 48 c1 e3 04 shl $0x4,%rbx 14aab: 48 03 9c 24 f0 01 00 add 0x1f0(%rsp),%rbx 14ab2: 00 _Z13phase1_threadP10THREADDATA(): /home/wjblanke/covid/chiapos/src/phase1.hpp:357 (discriminator 2) matches++; 14ab3: 80 bc 24 15 01 00 00 cmpb $0x1,0x115(%rsp) 14aba: 01 14abb: 48 83 9c 24 80 00 00 sbbq $0xffffffffffffffff,0x80(%rsp) 14ac2: 00 ff /home/wjblanke/covid/chiapos/src/phase1.hpp:362 (discriminator 2) if (metadata_size <= 128) { 14ac4: 80 bc 24 16 01 00 00 cmpb $0x80,0x116(%rsp) 14acb: 80 /home/wjblanke/covid/chiapos/src/phase1.hpp:360 (discriminator 2) R_entry.used = true; 14acc: c6 43 40 01 movb $0x1,0x40(%rbx) /home/wjblanke/covid/chiapos/src/phase1.hpp:362 (discriminator 2) if (metadata_size <= 128) { 14ad0: 0f 86 0a ff ff ff jbe 149e0 <phase1_thread(THREADDATA)+0x1000> /home/wjblanke/covid/chiapos/src/phase1.hpp:375 Bits(R_entry.right_metadata, metadata_size - 128));

Apparently R_entry.used = true is crashing because I assume R_entry is invalid (bucket_R[idx_R] is NULL?)

                    PlotEntry& L_entry = bucket_L[idx_L];
                    PlotEntry& R_entry = bucket_R[idx_R];

                    if (bStripeStartPair)
                        matches++;

                    // Sets the R entry to used so that we don't drop in next iteration
                    R_entry.used = true;
                    // Computes the output pair (fx, new_metadata)
                    if (metadata_size <= 128) {

Now maybe this could be due to bad RAM, or there is something wrong happening with in bucket_R but not in all runs with plotid.

wjblanke commented 3 years ago

OK I am optimistic we figured this one out. It's not due to bad RAM. Flakiness may be due to our edge case bug causing STL "undefined results".