Closed eFishCent closed 3 years ago
Not much to go on but adding this
37a0f: 0f 29 56 f0 movaps %xmm2,-0x10(%rsi) 37a13: 48 89 b4 24 88 02 00 mov %rsi,0x288(%rsp) 37a1a: 00 37a1b: 49 83 c6 04 add $0x4,%r14 37a1f: 4c 39 b4 24 80 00 00 cmp %r14,0x80(%rsp) 37a26: 00 37a27: 0f 84 e6 07 00 00 je 38213 PyInit_chiapos@@Base+0x53f3 37a2d: 41 0f b7 06 movzwl (%r14),%eax 37a31: 4c 8d 24 80 lea (%rax,%rax,4),%r12 37a35: 41 0f b7 46 02 movzwl 0x2(%r14),%eax 37a3a: 49 c1 e4 04 shl $0x4,%r12 37a3e: 4c 03 a4 24 c0 02 00 add 0x2c0(%rsp),%r12 37a45: 00 37a46: 48 8d 1c 80 lea (%rax,%rax,4),%rbx 37a4a: 48 c1 e3 04 shl $0x4,%rbx 37a4e: 48 03 9c 24 a0 02 00 add 0x2a0(%rsp),%rbx 37a55: 00 37a56: 80 7c 24 2f 01 cmpb $0x1,0x2f(%rsp) 37a5b: 48 83 9c 24 b0 00 00 sbbq $0xffffffffffffffff,0xb0(%rsp) 37a62: 00 ff 37a64: 80 bc 24 2e 01 00 00 cmpb $0x80,0x12e(%rsp) 37a6b: 80 *** SEGFAULT 37a6c: c6 43 40 01 movb $0x1,0x40(%rbx) 37a70: 0f 86 3a fc ff ff jbe 376b0 PyInit_chiapos@@Base+0x4890 37a76: 48 8b 73 30 mov 0x30(%rbx),%rsi 37a7a: 48 8b 53 38 mov 0x38(%rbx),%rdx 37a7e: 8b 8c 24 b8 00 00 00 mov 0xb8(%rsp),%ecx 37a85: 48 8b 7c 24 38 mov 0x38(%rsp),%rdi 37a8a: e8 81 89 ff ff callq 30410 _ZNSs6assignERKSs@@Base+0x11870 37a8f: 48 8b 73 20 mov 0x20(%rbx),%rsi 37a93: 48 8b 53 28 mov 0x28(%rbx),%rdx 37a97: b9 80 00 00 00 mov $0x80,%ecx 37a9c: 48 8b 7c 24 68 mov 0x68(%rsp),%rdi 37aa1: e8 6a 89 ff ff callq 30410 _ZNSs6assignERKSs@@Base+0x11870 37aa6: 66 0f 6f 94 24 90 04 movdqa 0x490(%rsp),%xmm2 37aad: 00 00 37aaf: 66 0f 6f 9c 24 a0 04 movdqa 0x4a0(%rsp),%xmm3 37ab6: 00 00 37ab8: 48 8b 84 24 e0 04 00 mov 0x4e0(%rsp),%rax 37abf: 00 37ac0: 0f b6 bc 24 e8 04 00 movzbl 0x4e8(%rsp),%edi 37ac7: 00 37ac8: 66 0f 6f a4 24 b0 04 movdqa 0x4b0(%rsp),%xmm4 37acf: 00 00 37ad1: 66 0f 6f ac 24 c0 04 movdqa 0x4c0(%rsp),%xmm5
The disassembly looks like jibberish to me, so I am leaning towards some kind of corruption.
Unfortunately I got past the error trying to reproduce the plot. Is it reproducible on your side? It's running a lot faster over here so maybe I missed something.
chiapos HEAD detached at 0.12.33
nohup ./ProofOfSpace -k 32 -u 128 -h 1 -b 5000 -i faf41e4e666e5b5a932b665fb032049b74d35ae4a18eba30f06f89dbab66e0b2 create &
Starting plotting progress into temporary dirs: . and . ID: faf41e4e666e5b5a932b665fb032049b74d35ae4a18eba30f06f89dbab66e0b2 Plot size is: 32 Buffer size is: 5000MiB Using 128 buckets Using 1 threads of stripe size 65536
...
Total matches: 4294966418
Forward propagation table time: 3377.315 seconds. CPU (77.300%) Thu Nov 12 17:27:53 2020 Computing table 4 ... Bucket 76 uniform sort. Ram: 3.620GiB, u_sort min: 1.625GiB, qs min: 0.812GiB. Bucket 77 uniform sort. Ram: 3.620GiB, u_sort min: 3.250GiB, qs min: 0.813GiB. Bucket 78 uniform sort. Ram: 3.620GiB, u_sort min: 1.625GiB, qs min: 0.812GiB.
faf41e4e666e5b5a932b665fb032049b74d35ae4a18eba30f06f89dbab66e0b2
Unfortunately I got past the error trying to reproduce the plot. Is it reproducible on your side? It's running a lot faster over here so maybe I missed something.
I tried to repro it, nada. I am guessing it was a memory corruption. Will let you know if I get more and try to narrow down repro steps.
My plotter got twice Segmentation fault (core dumped):
(venv) leonid@antec:~/chia-blockchain$ [4]+ Segmentation fault (core dumped) chia plots create -k 32 -n 10 -u 128 -b 4508 -r 4 -t /mnt/drive/c/plots/ -2 /mnt/drive/c/plots/ -d /mnt/drive/d/plots/ > antec-k32-ccd-u128-b4508-r4-
date +%Y-%m-%d-%H-%M-%S.log 2>&1
(venv) leonid@antec:~/chia-blockchain$
(venv) leonid@antec:~/chia-blockchain$ dmesg | grep -B 2 -A 6 "segfault"
[526650.381097] mce: CPU3: Core temperature/speed normal
[527183.148246] show_signal_msg: 18 callbacks suppressed
[527183.148249] chia[19256]: segfault at e00 ip 00007f329020ea6c sp 00007f328ec37670 error 6 in chiapos.cpython-38-x86_64-linux-gnu.so[7f32901d7000+71000]
[527183.148265] Code: c0 02 00 00 48 8d 1c 80 48 c1 e3 04 48 03 9c 24 a0 02 00 00 80 7c 24 2f 01 48 83 9c 24 b0 00 00 00 ff 80 bc 24 2e 01 00 00 80
[544360.071912] mce: CPU7: Core temperature/speed normal
[544360.071913] mce: CPU3: Core temperature/speed normal
[544956.631469] chia[20199]: segfault at db0 ip 00007fba62d15a6c sp 00007fba615eb670 error 6 in chiapos.cpython-38-x86_64-linux-gnu.so[7fba62cde000+71000]
[544956.631493] Code: c0 02 00 00 48 8d 1c 80 48 c1 e3 04 48 03 9c 24 a0 02 00 00 80 7c 24 2f 01 48 83 9c 24 b0 00 00 00 ff 80 bc 24 2e 01 00 00 80
(venv) leonid@antec:~/chia-blockchain$ chia version
1.0b17
(venv) leonid@antec:~/chia-blockchain$
Happened 3rd time with the same parameters (different plot ID though each time) - see below. Similar process with only different -t /mnt/drive/c/plots/ -2 /mnt/drive/c/plots/ is running fine.
[4]+ Segmentation fault (core dumped) chia plots create -k 32 -n 10 -u 128 -b 4508 -r 4 -t /mnt/drive/c/plots/ -2 /mnt/drive/c/plots/ -d /mnt/drive/d/plots/ > antec-k32-ccd-u128-b4508-r4-
date +%Y-%m-%d-%H-%M-%S.log 2>&1 (venv) leonid@antec:~/chia-blockchain$
[595409.964438] chia[21909]: segfault at 7c0 ip 00007fc7d3893a6c sp 00007fc7ce804670 error 6 in chiapos.cpython-38-x86_64-linux-gnu.so[7fc7d385c000+71000]
[595409.964456] Code: c0 02 00 00 48 8d 1c 80 48 c1 e3 04 48 03 9c 24 a0 02 00 00 80 7c 24 2f 01 48 83 9c 24 b0 00 00 00 ff 80 bc 24 2e 01 00 00 80
Happened 3rd time with the same parameters (different plot ID though each time) - see below. Similar process with only different -t /mnt/drive/c/plots/ -2 /mnt/drive/c/plots/ is running fine.
Here is my system information - fairly old system:
System: Host: antec Kernel: 5.4.0-52-generic x86_64 bits: 64 Desktop: N/A Distro: Ubuntu 20.04.1 LTS (Focal Fossa) Machine: Type: Desktop Mobo: ASUSTeK model: P7P55D-E PRO v: Rev 1.xx serial: <superuser/root required> BIOS: American Megatrends v: 0610 date: 12/10/2009 CPU: Topology: Quad Core model: Intel Core i7 860 bits: 64 type: MT MCP L2 cache: 8192 KiB Speed: 1204 MHz min/max: 1200/2801 MHz Core speeds (MHz): 1: 1205 2: 1212 3: 1229 4: 1228 5: 1226 6: 1278 7: 1299 8: 1297 Graphics: Device-1: Advanced Micro Devices [AMD/ATI] Juniper PRO [Radeon HD 5750] driver: radeon v: kernel Display: server: Colin Harrison X.org 11.0 driver: radeon resolution: 1536x864~N/A OpenGL: renderer: llvmpipe (LLVM 10.0.0 128 bits) v: 3.1 Mesa 20.0.8 Audio: Device-1: Intel 5 Series/3400 Series High Definition Audio driver: snd_hda_intel Device-2: Advanced Micro Devices [AMD/ATI] Juniper HDMI Audio [Radeon HD 5700 Series] driver: snd_hda_intel Sound Server: ALSA v: k5.4.0-52-generic Network: Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet driver: r8169 IF: enp5s0 state: up speed: 1000 Mbps duplex: full mac: e0:cb:4e:9b:5b:9b Drives: Local Storage: total: 1.06 TiB used: 6.48 GiB (0.6%) ID-1: /dev/sda vendor: Seagate model: ST3160827AS size: 149.05 GiB ID-2: /dev/sdb vendor: Seagate model: ST3500418AS size: 465.76 GiB ID-3: /dev/sdc vendor: Seagate model: ST3500418AS size: 465.76 GiB Partition: ID-1: / size: 145.71 GiB used: 6.48 GiB (4.4%) fs: ext4 dev: /dev/sda2 Sensors: System Temperatures: cpu: 36.0 C mobo: N/A gpu: radeon temp: 28 C Fan Speeds (RPM): N/A gpu: radeon fan: 0 Info: Processes: 170 Uptime: 8m Memory: 15.63 GiB used: 287.8 MiB (1.8%) Shell: bash inxi: 3.0.38
Happened 3rd time with the same parameters (different plot ID though each time) - see below. Similar process with only different -t /mnt/drive/c/plots/ -2 /mnt/drive/c/plots/ is running fine.
Happened again after system restart with the same -t /mnt/drive/c/plots/ -2 /mnt/drive/c/plots/ drive. Another process with -t /mnt/drive/b/plots/ -2 /mnt/drive/b/plots/ is running fine.
Could it be because of the target drive health? It is an old drive, probably pre-fail.
(venv) leonid@antec:~/chia-blockchain$ dmesg | grep -B 1 -A 5 segfault [61946.828597] show_signal_msg: 18 callbacks suppressed [61946.828600] chia[4182]: segfault at 90 ip 00007f7e3533fa6c sp 00007f7e33c55670 error 6 in chiapos.cpython-38-x86_64-linux-gnu.so[7f7e35308000+71000] [61946.828616] Code: c0 02 00 00 48 8d 1c 80 48 c1 e3 04 48 03 9c 24 a0 02 00 00 80 7c 24 2f 01 48 83 9c 24 b0 00 00 00 ff 80 bc 24 2e 01 00 00 80 <c6> 43 40 01 0f 86 3a fc ff ff 48 8b 73 30 48 8b 53 38 8b 8c 24 b8 (venv) leonid@antec:~/chia-blockchain$
I doubt it is related to the drive. The fact that the disassembly shows the crash happening in the same spot would seem to mean it isn't corruption. But I am not sure why we can't reproduce it using the same plotID. The disassembly seems to be in PyInit_chiapos which is pybind11 land and there are no exported symbols...
OK I got further by finding similar assembly in ProofOfSpace
_ZNSt6vectorI9PlotEntrySaIS0_EEixEm(): /usr/include/c++/9/bits/stl_vector.h:1043 (discriminator 2) return (this->_M_impl._M_start + __n); 14a8e: 48 8d 2c 80 lea (%rax,%rax,4),%rbp _Z13phase1_threadP10THREADDATA(): /home/wjblanke/covid/chiapos/src/phase1.hpp:354 (discriminator 2) PlotEntry& R_entry = bucket_R[idx_R]; 14a92: 41 0f b7 46 02 movzwl 0x2(%r14),%eax _ZNSt6vectorI9PlotEntrySaIS0_EEixEm(): /usr/include/c++/9/bits/stl_vector.h:1043 (discriminator 2) 14a97: 48 c1 e5 04 shl $0x4,%rbp 14a9b: 48 03 ac 24 d0 01 00 add 0x1d0(%rsp),%rbp 14aa2: 00 14aa3: 48 8d 1c 80 lea (%rax,%rax,4),%rbx 14aa7: 48 c1 e3 04 shl $0x4,%rbx 14aab: 48 03 9c 24 f0 01 00 add 0x1f0(%rsp),%rbx 14ab2: 00 _Z13phase1_threadP10THREADDATA(): /home/wjblanke/covid/chiapos/src/phase1.hpp:357 (discriminator 2) matches++; 14ab3: 80 bc 24 15 01 00 00 cmpb $0x1,0x115(%rsp) 14aba: 01 14abb: 48 83 9c 24 80 00 00 sbbq $0xffffffffffffffff,0x80(%rsp) 14ac2: 00 ff /home/wjblanke/covid/chiapos/src/phase1.hpp:362 (discriminator 2) if (metadata_size <= 128) { 14ac4: 80 bc 24 16 01 00 00 cmpb $0x80,0x116(%rsp) 14acb: 80 /home/wjblanke/covid/chiapos/src/phase1.hpp:360 (discriminator 2) R_entry.used = true; 14acc: c6 43 40 01 movb $0x1,0x40(%rbx) /home/wjblanke/covid/chiapos/src/phase1.hpp:362 (discriminator 2) if (metadata_size <= 128) { 14ad0: 0f 86 0a ff ff ff jbe 149e0 <phase1_thread(THREADDATA)+0x1000> /home/wjblanke/covid/chiapos/src/phase1.hpp:375 Bits(R_entry.right_metadata, metadata_size - 128));
Apparently R_entry.used = true is crashing because I assume R_entry is invalid (bucket_R[idx_R] is NULL?)
PlotEntry& L_entry = bucket_L[idx_L];
PlotEntry& R_entry = bucket_R[idx_R];
if (bStripeStartPair)
matches++;
// Sets the R entry to used so that we don't drop in next iteration
R_entry.used = true;
// Computes the output pair (fx, new_metadata)
if (metadata_size <= 128) {
Now maybe this could be due to bad RAM, or there is something wrong happening with in bucket_R but not in all runs with plotid.
OK I am optimistic we figured this one out. It's not due to bad RAM. Flakiness may be due to our edge case bug causing STL "undefined results".
Describe the bug Plotter segfault in phase 1
To Reproduce Start plotting process with following:
Expected behavior Completed plot
Relevant log file lines Crash occurred at following point:
Desktop (please complete the following information):
Additional context Contact @eFishCent on keybase for more info.