Chia-Network / bladebit

A high-performance k32-only, Chia (XCH) plotter supporting in-RAM and disk-based plotting
Apache License 2.0
337 stars 108 forks source link

bladebit_cuda *** Panic!!! *** Fatal Error: #373

Closed hajes closed 1 year ago

hajes commented 1 year ago

From time to time, bladebit_cuda crashes. I guess it is nVidia driver/OS combo. Also most likely overheating issues. We have heat waves in Europe. Reboot "cures" this issue...plotter runs many hours without issue. It may also be related to lack on space in destination drive, and plotter simply crashes instead of waiting for free space.

Once, plotting crashes...it will crash again until one reboots the rig.

SuperMicro X9DR7-TF+ 2 x Xeon 2697v2 512GB DDR-1867MHz ASUS RTX 3060Ti 8GB

Clear Linux OS (build 39830) kernel 6.1.46-1306.ltscurrent nvidia Driver Version: 535.98 CUDA Version: 12.2

./bladebit_cuda(_Z9PanicExitv+0xf)[0x560390015e9f]
./bladebit_cuda(_ZN10PlotWriter9WriteDataEPKhm+0x13e)[0x56038fffd05e]
./bladebit_cuda(_ZN10PlotWriter16WriterThreadMainEv+0x270)[0x56038fffda10]
./bladebit_cuda(_ZN6Thread17ThreadStarterUnixEPS_+0x80)[0x56038fe89ed0]
/usr/lib64/libc.so.6(+0x94f4a)[0x56038f694f4a]
/usr/lib64/libc.so.6(+0x122b08)[0x56038f722b08]
CUDA error: 4 (0x4 ) cudaErrorCudartUnloading : driver shutting down

*** Panic!!! *** Fatal Error:
CUDA error cudaErrorCudartUnloading : driver shutting down.
./bladebit_cuda(_ZN7SysHost14DumpStackTraceEv+0x5b)[0x56038fe8914b]
./bladebit_cuda(_Z9PanicExitv+0xf)[0x560390015e9f]
./bladebit_cuda(_ZN15GpuUploadBuffer11UploadArrayEPKvjjjjPKjP11CUstream_st+0x16a)[0x56038fe6d3da]
./bladebit_cuda(_ZN15GpuUploadBuffer11UploadArrayEPKvjjjjPKj+0x13)[0x56038fe6d6a3]
./bladebit_cuda(_Z10WritePark7R18CudaK32PlotContext+0x22c)[0x56038fe4fc4c]
./bladebit_cuda(_Z17CudaK32PlotPhase3R18CudaK32PlotContext+0x11e9)[0x56038fe46889]
./bladebit_cuda(_ZN14CudaK32Plotter3RunERK11PlotRequest+0x95b)[0x56038fe348bb]

second run...plotting crashes within a few hours again...only reboot helps.

*** Panic!!! *** Fatal Error:
Failed to write to plot with error 28:
./bladebit_cuda(_ZN7SysHost14DumpStackTraceEv+0x5b)[0x55c14940214b]
./bladebit_cuda(_Z9PanicExitv+0xf)[0x55c14958ee9f]
./bladebit_cuda(_ZN10PlotWriter9WriteDataEPKhm+0x13e)[0x55c14957605e]
./bladebit_cuda(_ZN10PlotWriter16WriterThreadMainEv+0x270)[0x55c149576a10]
./bladebit_cuda(_ZN6Thread17ThreadStarterUnixEPS_+0x80)[0x55c149402ed0]
/usr/lib64/libc.so.6(+0x94f4a)[0x55c148c94f4a]
/usr/lib64/libc.so.6(+0x122b08)[0x55c148d22b08]
jmhands commented 1 year ago

Failed to write to plot with error 28: is running out of space on destination drive. We have this fixed in bladebit 3.1 beta1. should be ready for 256GB mode in 3.1 beta2.

for the first error, I have seen that, something about the crash that crashes the nvidia driver. Reboot will reload the driver or you can manually reload the nvidia kernel modules as a workaround. Haven't been able to reproduce that one though

hajes commented 1 year ago

Yes, it is definitely out of space + driver crash. Clear Linux is very fast, but unstable with custom stuff.

Otherwise, it runs without issues.