Chia-Network / bladebit

A high-performance k32-only, Chia (XCH) plotter supporting in-RAM and disk-based plotting
Apache License 2.0
339 stars 109 forks source link

[Bladebit CUDA Plotter] CUDA error cudaErrorMemoryAllocation : out of memory. RTX3090? #338

Closed trysys closed 1 year ago

trysys commented 1 year ago

can any Help ?

[Bladebit CUDA Plotter] Selected cuda device 1 : NVIDIA GeForce RTX 3090 CUDA Compute Capability : 8.6 SM count : 82 Max blocks per SM : 16 Max threads per SM : 1536 Async Engine Count : 2 L2 cache size : 6.00 MB L2 persist cache max size : 4.50 MB Stack Size : 1.00 KB Memory: Total : 24.00 GB Free : 22.74 GB

Allocating buffers (this may take a few seconds)... Kernel RAM required : 90240524288 bytes ( 86060.07 MiB or 84.04 GiB ) Intermediate RAM required : 2999001088 bytes ( 2860.07 MiB or 2.79 GiB ) Host RAM required : 168443248640 bytes ( 160640.00 MiB or 156.88 GiB ) Total Host RAM required : 258683772928 bytes ( 246700.07 MiB or 240.92 GiB ) GPU RAM required : 6140243968 bytes ( 5855.79 MiB or 5.72 GiB ) Allocating buffers CUDA error: 2 (0x2 ) cudaErrorMemoryAllocation : out of memory

Panic!!! Fatal Error: CUDA error cudaErrorMemoryAllocation : out of memory.#

Use: Windows 11 / Every Bladebit Alphas

spleen911 commented 1 year ago

HAVE Memory: Total : 24.00 GB Free : 22.74 GB

NEED Total Host RAM required : 258683772928 bytes ( 246700.07 MiB or 240.92 GiB )

You need 240GB free and you only have 24GB total: upgrade to 256GB of system memory.

cuda
trysys commented 1 year ago

I Have 256GB DDR4 in my System.

My GPU has 24GB Memory

trysys commented 1 year ago

[Bladebit CUDA Plotter] Selected cuda device 1 : NVIDIA GeForce RTX 3090 CUDA Compute Capability : 8.6 SM count : 82 Max blocks per SM : 16 Max threads per SM : 1536 Async Engine Count : 2 L2 cache size : 6.00 MB L2 persist cache max size : 4.50 MB Stack Size : 1.00 KB Memory: Total : 24.00 GB Free : 22.74 GB

Allocating buffers (this may take a few seconds)... Kernel RAM required : 90240524288 bytes ( 86060.07 MiB or 84.04 GiB ) Intermediate RAM required : 2999001088 bytes ( 2860.07 MiB or 2.79 GiB ) Host RAM required : 168443248640 bytes ( 160640.00 MiB or 156.88 GiB ) Total Host RAM required : 258683772928 bytes ( 246700.07 MiB or 240.92 GiB ) GPU RAM required : 6140243968 bytes ( 5855.79 MiB or 5.72 GiB ) Allocating buffers CUDA error: 2 (0x2 ) cudaErrorMemoryAllocation : out of memory

Panic!!! Fatal Error: CUDA error cudaErrorMemoryAllocation : out of memory.

i Think its GPU Memory Problem

spleen911 commented 1 year ago

My bad on reading the mem stat wrong.

That is odd. I have 512GB installed since I was doing ramplot before cudaplot was a thing. Note, I have NUMA interleaving enabled in my BIOS since my system is dual Xeon and ramplot requires interleaving. I haven't tried alpha4 yet.

I plotted ~120TB (uncompressed) using alpha3. I would notice after long plot runs, or simultaneous farming, my next cudaplot instance would fail to allocate. Rebooting resolved for another ~100 plots. Right now, alpha4 is crashing during allocation with a nvlddmkm event logged by Windows (no error in the plotter log), but not surprised as I am currently farming with flexfarmer in a WSL2 session.

I could never get alpha3, or gigahorse, to allocate inside WSL2 so I run both native Windows 11 Pro. I've been using alpha3 for uncompressed plots and gigahorse for compressed since all my farming is with flexfarmer.

I guess I need to reboot and try alpha4.

Longshot, but maybe try the 528.49 nsd-dch-whql driver. I settled on that one for my plotter since it has stable WSL drivers built-in. On the Windows side, I wasn't OC-ing or doing anything that needed the latest game ready driver.

trysys commented 1 year ago

what can i do now ?

trysys commented 1 year ago

can anybody help ?

LeroyINC commented 1 year ago

try creating a compressed plot. they require less RAM and see if that works. I think a C5 or C7 only needs like 216GB vs 241GB

if that works. then you know that you just don't have enough free RAM. you may need to try closing any unneeded windows process or applications to help free up more RAM.

trysys commented 1 year ago

Is the Same Problem, at C5 or C7 or C9

Panic!!! Fatal Error: CUDA error cudaErrorMemoryAllocation : out of memory.

Look at The Mem stats

C0

Allocating buffers (this may take a few seconds)... Kernel RAM required : 90240524288 bytes ( 86060.07 MiB or 84.04 GiB ) Intermediate RAM required : 2999001088 bytes ( 2860.07 MiB or 2.79 GiB ) Host RAM required : 168443248640 bytes ( 160640.00 MiB or 156.88 GiB ) Total Host RAM required : 258683772928 bytes ( 246700.07 MiB or 240.92 GiB ) GPU RAM required : 6140243968 bytes ( 5855.79 MiB or 5.72 GiB ) Allocating buffers CUDA error: 2 (0x2 ) cudaErrorMemoryAllocation : out of memory

C5

Allocating buffers (this may take a few seconds)... Kernel RAM required : 90240524288 bytes ( 86060.07 MiB or 84.04 GiB ) Intermediate RAM required : 2999001088 bytes ( 2860.07 MiB or 2.79 GiB ) Host RAM required : 141733920768 bytes ( 135168.00 MiB or 132.00 GiB ) Total Host RAM required : 231974445056 bytes ( 221228.07 MiB or 216.04 GiB ) GPU RAM required : 6140243968 bytes ( 5855.79 MiB or 5.72 GiB )

C7 Allocating buffers (this may take a few seconds)... Kernel RAM required : 90240524288 bytes ( 86060.07 MiB or 84.04 GiB ) Intermediate RAM required : 2999001088 bytes ( 2860.07 MiB or 2.79 GiB ) Host RAM required : 141733920768 bytes ( 135168.00 MiB or 132.00 GiB ) Total Host RAM required : 231974445056 bytes ( 221228.07 MiB or 216.04 GiB ) GPU RAM required : 6140243968 bytes ( 5855.79 MiB or 5.72 GiB ) Allocating buffers CUDA error: 2 (0x2 ) cudaErrorMemoryAllocation : out of memory

Panic!!! Fatal Error: CUDA error cudaErrorMemoryAllocation : out of memory.

What is happening here ?

244 GB is free actualy

LeroyINC commented 1 year ago

hmm... could be GPU memory issues.

have you tried to run nvidia-smi and see what other processes are using the GPU and how much memory they are using? could be some process is hogging memory or cache in the GPU

trysys commented 1 year ago

nvidia smi is running on WIndows ?

bdelgado1995 commented 1 year ago

I've seen this error when I tried running the plotter with only 128 GB RAM. When I added another 128 GB of RAM, the issue went away. So one possibility can be the system RAM not being sufficient. However, I see that some of the people with the issue do have enough system RAM installed but I'm wondering if perhaps that the system RAM is not fully available for some reason (?). I'm using Ubuntu 22, 256 GB RAM, and an RTX 3070.

honglio commented 1 year ago

Have exactly the same issue when I run bladebit-cuda-v3.0.0-rc1-windows-x86-64 on windows 11 pro.

[Bladebit CUDA Plotter] Selected cuda device 0 : NVIDIA GeForce RTX 3080 GPU CUDA Compute Capability : 8.6 SM count : 48 Max blocks per SM : 16 Max threads per SM : 1536 Async Engine Count : 1 L2 cache size : 4.00 MB L2 persist cache max size : 3.00 MB Stack Size : 1.00 KB Memory: Total : 10.00 GB Free : 9.69 GB

Allocating buffers (this may take a few seconds)... Kernel RAM required : 90240524288 bytes ( 86060.07 MiB or 84.04 GiB ) Intermediate RAM required : 2999001088 bytes ( 2860.07 MiB or 2.79 GiB ) Host RAM required : 168443248640 bytes ( 160640.00 MiB or 156.88 GiB ) Total Host RAM required : 258683772928 bytes ( 246700.07 MiB or 240.92 GiB ) GPU RAM required : 6140243968 bytes ( 5855.79 MiB or 5.72 GiB ) Allocating buffers CUDA error: 2 (0x2 ) cudaErrorMemoryAllocation : out of memory

When adding '--memory' as an option with 'cudaplot', show below:

required : 446676598784 (416Gb) total : 409795571712 (384Gb) available: 396401786880

I have 384Gb RAM. but 'cudaplot' suppose to require 256Gb RAM in total as said above of this page.

Please help!

harold-b commented 1 year ago

This seems to be an issue that's popping up with CI-generated builds. If you are able to the CUDA toolkit locally and build on your machine, it might work. That may not be the same issue for all here, but could be the one that some of your are encountering.

jmhands commented 1 year ago

I think we have also seen this message when there was something wrong with nvidia driver. You could try a clean install of drivers and reboot, and see if the issue persists. Can you please paste the entire command line you are using?

honglio commented 1 year ago

@jmhands Ty for respond. I reinstall nvidia driver, but nothing change. I have 4 GPUs on this PC, and I decide to disable 3 of them to see what happen. It works then! BTW, I use cmd ''bladebit_cuda -f -c -n 1 cudaplot /mnt/ssd'' which I learned from chia decentral.

jmhands commented 1 year ago

with all the gpus in the system, you could try the -d to select the CUDA device (before cudaplot). We have seen all sorts of weird behavior with different nvidia drivers though, thanks for reporting

jmhands commented 1 year ago

seems like an issue with multiple gpus, possibly something with pinned memory. We see same issue on 128GB mode when Windows doesn't have enough shared/pinned memory available. @harold-b can take a look at some point, but closing since you have a workaround