Chia-Network / bladebit

A high-performance k32-only, Chia (XCH) plotter supporting in-RAM and disk-based plotting
Apache License 2.0
339 stars 109 forks source link

bladebit 3.0.0 crashes #389

Open jayhohoho2019 opened 10 months ago

jayhohoho2019 commented 10 months ago

bladebit frequently crashes on my plotter. Most recent crash had the following screen dump:

Progress update: 0.48 Prunning table 4... Finished prunning table 4 in 9.35 seconds. Progress update: 0.51 Prunning table 3... Finished prunning table 3 in 9.02 seconds. Progress update: 0.55 Finished Phase 2 in 28.75 seconds. Running Phase 3 Compressing tables 2 and 3... STDERR: Crashed!

STDERR: /home/jh/chia-blockchain/venv/bin/bladebit_cuda(_Z12CrashHandleri+0xaa)[0x55a6e54e3eda]

STDERR: /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f425d036520]

STDERR: /lib/x86_64-linux-gnu/libc.so.6(+0x1afbba)[0x7f425d1a3bba]

STDERR: /home/jh/chia-blockchain/venv/bin/bladebit_cuda(_Z15WriteParkThreadP12WriteParkJob+0x218)[0x55a6e564f1f8]

STDERR: /home/jh/chia-blockchain/venv/bin/bladebit_cuda(_ZN10ThreadPool17FixedThreadRunnerEPv+0x52)[0x55a6e5659822]

STDERR: /home/jh/chia-blockchain/venv/bin/bladebit_cuda(ZN6Thread17ThreadStarterUnixEPS+0x80)[0x55a6e54e4f90]

STDERR: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7f425d088b43]

STDERR: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00)[0x7f425d11aa00]

STDERR: Dumping crash to crash.log

I am however unable to find crash.log

cmd run is:

chia plotters bladebit ramplot -c '"$c"' -f '"$f"' -r '"$r"' -n '"$n"' -v -w -d '"$dst"' --compress '"$compress"'

where r=85 (server physical cores 52), n = 150, compress = 3. dst drive is no where near full.

server has dual Xeon 6230R gold CPUs and 512GB RAM. No GPU.

OS is Ubutu Server 22.04.3 LTS, 5.15.0-82-generic #91-Ubuntu SMP Mon Aug 14 14:14:14 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Upgraded to Chia 2.0 from source, and installed bladebit with the supplied install-plotter.sh script. bladebit and bladebit_cuda binaries are installed in ~chia-blockchain/venv/bin.

Happens consistently. Never happened before bladebit 3.0. Only other programs running at the same time are chia_harvester/daemon, and plow.py

jayhohoho2019 commented 10 months ago

When the most recent crash above occurred, 7 plots had been created. The 8th was a plot.tmp file in the final directory.

For the first 3 plots, each plot took around 6.8 minutes to complete, consistent with bladebit 2, then it took progressively longer to complete each plot. The 7th plot took 7.92 minutes to complete, and the 8th plot crashed at the beginning of Phase 3.

jayhohoho2019 commented 10 months ago

Found crash.log in ~/chia-blockchain, but it contains the same info as the STDERR above: /home/jh/chia-blockchain/venv/bin/bladebit_cuda(_Z12CrashHandleri+0xaa)[0x55a6e54e3eda] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f425d036520] /lib/x86_64-linux-gnu/libc.so.6(+0x1afbba)[0x7f425d1a3bba] /home/jh/chia-blockchain/venv/bin/bladebit_cuda(_Z15WriteParkThreadP12WriteParkJob+0x218)[0x55a6e564f1f8] /home/jh/chia-blockchain/venv/bin/bladebit_cuda(_ZN10ThreadPool17FixedThreadRunnerEPv+0x52)[0x55a6e5659822] /home/jh/chia-blockchain/venv/bin/bladebit_cuda(ZN6Thread17ThreadStarterUnixEPS+0x80)[0x55a6e54e4f90] /lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7f425d088b43] /lib/x86_64-linux-gnu/libc.so.6(+0x126a00)[0x7f425d11aa00]

jayhohoho2019 commented 10 months ago

Did another run when n = 60, when n=2 it crashed again, this time with the following messages:

Progress update: 0.55 Finished Phase 2 in 31.15 seconds. Running Phase 3 Compressing tables 2 and 3... STDERR:

STDERR: Fatal Error:

STDERR: Overran park buffer: 6355 / 6352

jayhohoho2019 commented 10 months ago

Third run crashed when n = 4, consistently at the beginning of Phase 3: Finished Phase 2 in 27.46 seconds. Running Phase 3 Compressing tables 2 and 3... STDERR: Crashed!

STDERR:

jayhohoho2019 commented 10 months ago

This keeps happening every few plots and is driving me crazy. Won't be able to replot until this is fixed. I'm not using cuda or diskplot. I'm using the good old ramplot that I was using in v1 and v2. So this crash problem is actually a regression for me. Would appreciate it very much if it could be fixed.

The following errors shows about half of the time it crashes.

STDERR: Overran park buffer: 6355 / 6352

jayhohoho2019 commented 10 months ago

Please note I'm running this with CLI. No GUI involved.

jayhohoho2019 commented 10 months ago

After upgrading to 2.0.0, a new config file was generated with chia init.