bladebit_cuda only using single GPU (device 0)

XCHSystems commented 1 year ago

Will a future version support multiple GPUs? Will there be an option to specify the device in order to run multiple instances of bladebit_cuda against different GPUs?

I have two A4000 GPUs and bladebit_cuda only uses device 0, is there a special build / hidden command to utilise more than one GPU?

Simon

XCHSystems commented 1 year ago

Looking at line 12 of CudaPlotter.h

uint32 deviceIndex = 0; // Which CUDA device to use when plotting//

So is it possible to have this as a parsable option on the command line, but maybe to also allow multiple devices as well as specifying a device?

harold-b commented 1 year ago

You can already pass the device index as a prameter to the cudaplot command:

bladebit_cuda -f .. -c … cudaplot -d 1 …

harold-b commented 1 year ago

Multiple GPU is already planned but there are a number of other tasks ahead of it

XCHSystems commented 1 year ago

You can already pass the device index as a prameter to the cudaplot command:

bladebit_cuda -f .. -c … cudaplot -d 1 …

Hi Harold

I have already tried the -d 1 option

./bladebit/bladebit_cuda -t 1 -n 1 -f 8f6986edcaa42b3f9ab1abd27df7f2224149787414564629f39f8ceada85bf3abd7dd899296e2d0a9a138875191dd5ab -c xch1jlje9r7ndepgt3rrm4w7taayn0d6yh5654wwv4msx2226z7rx8as2puwzq cudaplot -d 1 /Plotdisks/RAID/

Bladebit Chia Plotter
Version      : 3.0.0-alpha1
Git Commit   : f269db0a7ad307514e993c335897cea7ebf46eda
Compiled With: gcc 9.4.0

[Global Plotting Config]
 Will create 1 plots.
 Thread count          : 1
 Warm start enabled    : false
 NUMA disabled         : false
 CPU affinity disabled : false
 Farmer public key     : 8f6986edcaa42b3f9ab1abd27df7f2224149787414564629f39f8ceada85bf3abd7dd899296e2d0a9a138875191dd5ab
 Pool contract address : xch1jlje9r7ndepgt3rrm4w7taayn0d6yh5654wwv4msx2226z7rx8as2puwzq
 Benchmark mode        : disabled

[Bladebit CUDA Plotter]
Selected cuda device 0 : NVIDIA RTX A4000
 CUDA Compute Capability   : 8.6
 SM count                  : 48
 Max blocks per SM         : 16
 Max threads per SM        : 1536
 Async Engine Count        : 2
 L2 cache size             : 4.00 MB
 L2 persist cache max size : 3.00 MB
 Stack Size                : 1.00 KB
 Memory:
  Total                    : 15.73 GB
  Free                     : 9.21 GB

As you can see it still uses device 0

And you can see from the following output from nvidia-smi, device 0 is being used already by bladebit_cuda


+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A4000    On   | 00000000:51:00.0 Off |                  Off |
|100%   57C    P2   132W / 140W |   6516MiB / 16376MiB |     85%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A4000    On   | 00000000:8A:00.0 Off |                  Off |
|100%   26C    P8    16W / 140W |     12MiB / 16376MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3628      G   /usr/lib/xorg/Xorg                  9MiB |
|    0   N/A  N/A      3682      G   /usr/bin/gnome-shell                4MiB |
|    0   N/A  N/A      5954      C   ...ng/bladebit/bladebit_cuda     6016MiB |
|    1   N/A  N/A      3628      G   /usr/lib/xorg/Xorg                  8MiB |
+-----------------------------------------------------------------------------+

XCHSystems commented 1 year ago

Multiple GPU is already planned but there are a number of other tasks ahead of it

Happy to do some testing on that when you are ready, JM knows me :-)

XCHSystems commented 1 year ago

@harold-b

It seems that the -d or --device option is just ignored by the command, it is not throwing an error like if I do a -D instead of a -d, but it seems that the -d 1 or --device 1 is just simply ignored

harold-b commented 1 year ago

I looked over the relevant code over the weekend, and it certainly should be using the parameter, unless I missed something (which is likely). Could you please share a full log with the -d parameter used with something other than 0?

XCHSystems commented 1 year ago

I looked over the relevant code over the weekend, and it certainly should be using the parameter, unless I missed something (which is likely). Could you please share a full log with the -d parameter used with something other than 0?

@harold-b When you say full log, what are you asking for? My currently running plot process has -d 1 defined but obviously it is using GPU 0 still, so let me know what you need

XCHSystems commented 1 year ago

@harold-b

Could it be this line in CudaPlotter.h

uint32 deviceIndex = 0; // Which CUDA device to use when plotting/

XCHSystems commented 1 year ago

@harold-b By modifying that value to 1, and re-compiling, I can now run two instances of bladebit_cuda each to a different GPU. So it could be that the setting there is over-riding the -d or --device

harold-b commented 1 year ago

@harold-b

Could it be this line in CudaPlotter.h

uint32 deviceIndex = 0; // Which CUDA device to use when plotting/

That is just the default value. The value gets parsed from CLI here: https://github.com/Chia-Network/bladebit/blob/cuda-compression/cuda/CudaPlotter.cu#L69

But I did find the issue. Device initialization is done before the config is assigned to the context. So I just need to swap-out a couple of lines

harold-b commented 1 year ago

Fixed in 221fb883990dba6f0d12a9dbdd7de711de41f174

harold-b commented 1 year ago

@FlexiMiners If you get a chance to test that commit, please let me know if it worked for you

XCHSystems commented 1 year ago

@harold-b As soon as my current dual plotting process run completes, I will compile and run and let you know

XCHSystems commented 1 year ago

@harold-b Yes that works, thank you, now all we need is multi GPU so that it's not consuming twice as much RAM 👍

yjiangnan commented 1 year ago

@XCHSystems What do you mean it's not consuming twice as much RAM? We will configure our plotting machines with two GPUs but only 256G RAM, could it run two instances of plotter? Thanks!

XCHSystems commented 1 year ago

@XCHSystems What do you mean it's not consuming twice as much RAM? We will configure our plotting machines with two GPUs but only 256G RAM, could it run two instances of plotter? Thanks!

You need 256GB of RAM per bladebit_cuda instance. So in order to use two GPUs you need to run 2x bladebit_cuda

yjiangnan commented 1 year ago

@XCHSystems So, if my machine is limited by a maximum of 256GB of RAM, will I only be able to use a single GPU? If bladebit_cuda is able to split the computation into two GPUs to plot faster, that will also be good for me.

XCHSystems commented 1 year ago

@XCHSystems So, if my machine is limited by a maximum of 256GB of RAM, will I only be able to use a single GPU? If bladebit_cuda is able to split the computation into two GPUs to plot faster, that will also be good for me.

Correct, I know Harold is working on a Multi GPU bladebit_cuda which will only require a single bladebit_cuda instance, but I do not think that will be relatively soon.

We plot with multiple GPU by running multiple Instances, in one system we have 1TB RAM so we are utilising 4x GPU for plotting (4xA4000) which generates four plots every two minutes.

Chia-Network / bladebit

bladebit_cuda only using single GPU (device 0) #274