brian-team / brian2cuda

A brian2 extension to simulate spiking neural networks on GPUs
https://brian2cuda.readthedocs.io/
GNU General Public License v3.0
61 stars 12 forks source link

Bypass nvidia-smi when using a Jetson platform #312

Closed NikVard closed 8 months ago

NikVard commented 8 months ago

I am currently working on running a Brian2 model on a Jetson AGX Xavier, which runs on Ubuntu 20.4 and the NVIDIA drivers do not support the nvidia-smi binary. Instead, I am setting the path to the deviceQuery binary, which is ignored in favor of nvidia-smi.

I noticed that in practice, if the nvidia-smi binary is not found, the function _run_command_with_output() will return an error and never use the fallback deviceQuery binary.

mstimberg commented 8 months ago

Hi @NikVard, could you give some more detail on your settings and the error message ? Also note that if using deviceQuery does not work, you can always manually set parameters and disable the automatic GPU detection: https://brian2cuda.readthedocs.io/en/latest/introduction/cuda_configuration.html

NikVard commented 8 months ago

Hi @mstimberg, thanks for the prompt reply. The MWE is as follows:

from brian2 import *
import brian2cuda
set_device("cuda_standalone")

# Set the path to the deviceQuery binary
prefs.devices.cuda_standalone.cuda_backend.device_query_path = "/usr/local/cuda-11.4/samples/1_Utilities/deviceQuery/deviceQuery"

# Run the test
brian2cuda.example_run()

The last line of the error traceback is "RuntimeError: Running 'nvidia-smi -L' failed. This typically means that you have no NVIDIA driver installed. Are you sure there is an NVIDIA GPU on this machine?"

Running the binary manually gives me the correct information (image attached): Screenshot from 2024-02-27 17-19-15

I followed the instructions from the brian2cuda documentation found here (point 2). Manually setting the preferences devices.cuda_standalone.cuda_backend.detect_gpus = False, devices.cuda_standalone.cuda_backend.compute_capability = 7.2, and devices.cuda_standalone.cuda_backend.gpu_id = 0seems to work, however, there are other errors (themake` directives get killed by the system, which makes me wonder if there are other issues). If it helps, here is the full configuration as printed prior to running the model: output.txt.

Note that manually setting the above parameters leads to the example run completing successfully. I took a look at the code and there are provisions for running the deviceQuery binary (in the utils/gputools.py), but it might be that a check that verifies that the nvidia-smi binary is there is missed.

Let me know if there is anything I can test on my end or if I have neglected some information!

denisalevi commented 8 months ago

Just to be sure I understand it right: You do have the nvidia-smi binary, it just comes from an older driver version and does not support querying GPU information? Or do you not have the binary at all?

You need nvidia-smi even if you specify a custom deviceQuery path. If you don't have nvidia-smi at all, you can disable automatic GPU detection all together as @mstimberg mentioned:

prefs.devices.cuda_standalone.cuda_backend.detect_gpus = False
prefs.devices.cuda_standalone.cuda_backend.compute_capability = <compute_capability>
prefs.devices.cuda_standalone.cuda_backend.runtime_version = <runtime_version>
NikVard commented 8 months ago

@denisalevi Understood. On the Jetson platform, the nvidia-smi binary is not available at all and from the documentation I got that in case you are using older drivers which do not support the use of nvidia-smi, then the use of the deviceQuery binary would kick in instead.

On a similar note, should the runtime version be also set manually? Are there any other parameters that you would suggest I set?

denisalevi commented 8 months ago

Ah I see. Setting the deviceQuery binary is meant for setups in which nvidia-smi is available, but nvidia-smi --query-gpu=<parameter> is not (that option was only added around CUDA 11.6 I believe). But even if you set deviceQuery, nvidia-smi is still used to get a list of all available GPUs. So in your case without nvidia-smi, it will still fail.

The solution for you is then to disable automatic GPU detection. But you mentioned something about additional errors? If so, full error messages would be helpful.

The runtime version is set automatically via nvcc --version. As long as you have the nvcc binary available (which you need for compilation of the generated code anyways), you should be fine.

I just found a typo in the docs. It should be prefs.devices.cuda_standalone.cuda_backend.cuda_runtime_version. But as I said, you probably won't need to set it.

NikVard commented 8 months ago

I was just about to post a message about the typo, but you beat me to it. The other issues I am facing have more to do with memory optimization and I think are not relevant to this issue. Thanks for the help, setting everything manually does work nicely and the test completes successfully!