Chia-Network / bladebit

A high-performance k32-only, Chia (XCH) plotter supporting in-RAM and disk-based plotting
Apache License 2.0
340 stars 107 forks source link

CUDA error: 700 (0x2bc) cudaErrorIllegalAddress : an illegal memory access was encountered #445

Open 9cat opened 6 months ago

9cat commented 6 months ago

Bladebit Chia Plotter Version : 3.1.0-dev Git Commit : e9836f8bd963321457bc86eb5d61344bfb76dcf0 Compiled With: gcc 11.4.0

[Global Plotting Config] Will create 1 plots. Thread count : 16 Warm start enabled : false NUMA disabled : false CPU affinity disabled : false Farmer public key : xxxxxxxxxxxxxxxxxx367 Pool contract address : xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Compression Level : 7 Benchmark mode : disabled Warning: 16G mode is experimental and still under development. Please use the --check parameter to validate plots when using this mode. Direct I/O not supported in 16G mode at the moment. Disabing it.

[Bladebit CUDA Plotter] Host RAM : 31 GiB Plot checks : enabled ( 2 ) Plot check threshold: 0.600

Selected cuda device 0 : NVIDIA GeForce GTX 1070 CUDA Compute Capability : 6.1 SM count : 15 Max blocks per SM : 32 Max threads per SM : 2048 Async Engine Count : 2 L2 cache size : 2.00 MB L2 persist cache max size : 0.00 MB Stack Size : 1.00 KB Memory: Total : 7.92 GB Free : 7.84 GB

Allocating buffers (this may take a few seconds)... Kernel RAM required : 4979771088 bytes ( 4749.08 MiB or 4.64 GiB ) Intermediate RAM required : 4529922048 bytes ( 4320.07 MiB or 4.22 GiB ) Host RAM required : 2147483648 bytes ( 2048.00 MiB or 2.00 GiB ) Total Host RAM required : 7127254736 bytes ( 6797.08 MiB or 6.64 GiB ) GPU RAM required : 6314045440 bytes ( 6021.54 MiB or 5.88 GiB ) Allocating buffers... Done.

Generating plot 1 / 1: d7ce4357f4139ba7acf4c1d2ba211981a8b2da90377004661b5a1226201ab726 Plot temporary file: /nvme/chia/output/plot-k32-c07-2023-12-11-03-01-d7ce4357f4139ba7acf4c1d2ba211981a8b2da90377004661b5a1226201ab726.plot.tmp

Generating F1 Finished F1 in 665.61 seconds. Table 2 completed in 923.05 seconds with 4294939070 entries. Table 3 completed in 1197.82 seconds with 4294899125 entries. Table 4 completed in 1524.01 seconds with 4294813535 entries. Table 5 completed in 1387.38 seconds with 4294566059 entries. Table 6 completed in 1138.78 seconds with 4294122506 entries. Table 7 completed in 831.97 seconds with 4293295409 entries. Finalizing Table 7 Finalized Table 7 in 352.01 seconds. Completed Phase 1 in 8020.65 seconds Marked Table 6 in 49.38 seconds. Marked Table 5 in 38.19 seconds. Marked Table 4 in 38.14 seconds. Marked Table 3 in 36.93 seconds. Completed Phase 2 in 162.66 seconds Compressing Table 2 and 3... Step 1 completed step in 631.68 seconds. Step 2 completed step in 451.55 seconds. Completed table 2 in 1083.24 seconds with 3439777426 / 4294899125 entries ( 80.09% ). Compressing tables 3 and 4... Step 1 completed step in 481.10 seconds. CUDA error: 700 (0x2bc) cudaErrorIllegalAddress : an illegal memory access was encountered

Panic!!! Fatal Error:
CUDA error cudaErrorIllegalAddress : an illegal memory access was encountered. ./bladebit_cuda(_ZN7SysHost14DumpStackTraceEv+0x53)[0x56302ebc6d93] ./bladebit_cuda(_Z9PanicExitv+0xf)[0x56302ed5827f] ./bladebit_cuda(+0xb9e7f)[0x56302eba9e7f] ./bladebit_cuda(_ZN15GpuUploadBuffer11UploadArrayEPKvjjjjPKjP11CUstream_st+0x549)[0x56302ebaa829] ./bladebit_cuda(_ZN15GpuUploadBuffer11UploadArrayEPKvjjjjPKj+0x13)[0x56302ebaaa33] ./bladebit_cuda(_Z22CudaK32PlotPhase3Step2R18CudaK32PlotContext+0x695)[0x56302eb8bc35] ./bladebit_cuda(_Z17CudaK32PlotPhase3R18CudaK32PlotContext+0x1286)[0x56302eb83756] ./bladebit_cuda(_ZN14CudaK32Plotter3RunERK11PlotRequest+0x122f)[0x56302eb7739f] ./bladebit_cuda(main+0xc1f)[0x56302eb6b2df] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f7dd8970d90] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f7dd8970e40] ./bladebit_cuda(_start+0x25)[0x56302eb6c9a5]


Under Ubuntu 22.08. 32G memory , NV1070GPU.


CHATGPT result:

It looks like you've encountered a fatal error in your CUDA program. The error message indicates a CUDA error with the code cudaErrorIllegalAddress: an illegal memory access was encountered. This error typically occurs when the program tries to access memory that it's not allowed to access.

Here are a few steps you can take to troubleshoot and fix the issue:

Check the CUDA Error Code: The specific CUDA error code is 700 (cudaErrorIllegalAddress). This suggests a memory access violation. You should carefully review the parts of your code that deal with memory management, especially memory allocation, copying, and freeing.

Review CUDA Kernel Code: If your code includes CUDA kernel functions (_Z22CudaK32PlotPhase3Step2R18CudaK32PlotContext, _Z17CudaK32PlotPhase3R18CudaK32PlotContext, etc.), review the corresponding kernel code to ensure there are no out-of-bounds memory accesses or other memory-related issues.

Check CUDA API Calls: Review the CUDA API calls, especially memory-related functions such as cudaMemcpy, cudaMalloc, and cudaFree. Make sure that memory allocations are correct and that data is being transferred correctly between the host and the device.

Verify GPU Compatibility: Ensure that your GPU is compatible with the CUDA version you are using. Some GPUs may have specific requirements or limitations with certain CUDA versions.

Memory Profiling: Consider using CUDA memory profiling tools, such as NVIDIA Nsight or CUDA-MEMCHECK, to identify memory-related issues in your code.

Update CUDA Drivers: Make sure you have the latest NVIDIA CUDA drivers installed. Outdated drivers can sometimes cause compatibility issues.

Check GPU Resource Limits: Verify that you are not exceeding the GPU's memory limits. CUDA kernels may fail if they attempt to allocate more memory than is available on the GPU.

Review Recent Changes: If the error started occurring after recent code changes, review those changes for potential issues.


Any idea how to solve it?

DCTech2k commented 5 months ago

Same issue on Debian 12 with 16GB mode while 128GB mode works fine on the same setup.

GPU: GTX 1080 (as secondary while AMD is primary using MESA driver) Driver: 545.23.08 CUDA: 12.3

hedandan1989 commented 4 months ago

same problem ubuntu 22.04 LTS Driver:535 GTX 1080