Any explicit or implicit import of Qiskit Aer would initialize all GPUs on the system if the CUDA support is built

leofang commented 1 year ago

Informations

Qiskit Aer version: 0.11.2
Python version: 3.9.15
Operating system: Ubuntu 20.04

What is the current behavior?

Importing Qiskit Aer either implicitly or explicitly, as shown below, would get all GPUs on the system initialized, as evidenced by monitoring nvidia-smi (there are other tools to check this, but nvidia-smi is the simplest).

Steps to reproduce the problem

Install qiskit-aer-gpu from PyPI (or build from source; how it's installed is irrelevant as long as the CUDA support is built)
Run any of the following command to get Aer imported
- python -i -c "import qiskit_aer"
- python -i -c "import qiskit.providers.aer"
- python -i -c "from qiskit.providers.aer import AerSimulator"
- python -i -c "import qiskit; print(qiskit.__qiskit_version__)"

While the Python interpreter is idle waiting for input (due to the interactive prompt -i), check nvidia-smi. On a multi-GPU system, it is clear that the CUDA context is initialized on all GPUs:


$ nvidia-smi 
Sat Dec 17 19:05:35 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 530.06       Driver Version: 530.06       CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    On   | 00000000:21:00.0 Off |                  Off |
| 30%   53C    P2    88W / 300W |    267MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX 6000...  On   | 00000000:22:00.0 Off |                  Off |
| 30%   58C    P2    82W / 300W |    431MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX 6000...  On   | 00000000:41:00.0 Off |                  Off |
| 30%   47C    P2    75W / 300W |    431MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000    On   | 00000000:43:00.0 Off |                  Off |
| 30%   57C    P2    86W / 300W |    267MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 316201 C python 264MiB | | 1 N/A N/A 316201 C python 428MiB | | 2 N/A N/A 316201 C python 428MiB | | 3 N/A N/A 316201 C python 264MiB | +-----------------------------------------------------------------------------+



### What is the expected behavior?

Don't initialize the CUDA context at all at import time (with either explicit or implicit import). It hurts for many reasons:
1. Performance: CUDA context initialization is known to be costly, and it is best to defer the init until it's actually needed. This is how most Python GPU packages (CuPy, PyTorch, TF, ...) work these days.
    - I suspect this bug could contribute to some perf issues reported earlier, such as #1272. While I don't have direct proof, it's almost certain that at least for multiple processes they would compete for shared resources, see below.
2. Unexpected: It's not expected that simple version queries like `qiskit.__qiskit_version__` would initialize GPUs.
    - This impacts all downstream packages directly or indirectly depending on Qiskit Aer (such as cuQuantum Python 😅) 
3. Resource contention: On a shared system like in my example, this bug could interfere with
    - other users sharing the system, unless guarded by sophisticated (and correctly configured) resource management systems such as Slurm, or
    - multiple processes launched via the main process
4. CI/CD: Many public CI/CD pipelines (ex: conda-forge) do not have GPUs, but they could still do simple packaging tests for GPU packages. But the test might fail depending on how they test the package (that depends on Aer).

btw this bug is irrelevant of the number of GPUs -- even on a single-GPU system the issue would show up, but it does make the situation a lot worse on multi-GPU systems like NVIDIA DGX-A100.

### Suggested solutions

The implementation (not semantics!) of the following two functions must be re-designed:
- https://github.com/Qiskit/qiskit-aer/blob/13937fdb596142006bf00caf5676da13b43dfb5a/qiskit_aer/backends/backend_utils.py#L119
- https://github.com/Qiskit/qiskit-aer/blob/13937fdb596142006bf00caf5676da13b43dfb5a/qiskit_aer/backends/backend_utils.py#L137

as they both together contribute to this bug. Currently, how Qiskit Aer lists all available methods/devices is to run dummy executions and check for errors. This incurs not only runtime overheads but also GPU init issues when available.

I would suggest that these two attributes should be exposed all the way from C++ to Python through pybind11. This should be easily doable and enables much more lightweight checks, something we'd also like to ask for (but could be discussed in a separate ticket) 🙂 

Thanks!

leofang commented 1 year ago

cc: @tlubowe @yangcal for vis

leofang commented 1 year ago

(edited to add a CI/CD concern)

doichanj commented 1 year ago

Qiskit Aer uses all the GPUs specified in CUDA_VISIBLE_DEVICES environmental variable. Is it not enough to limit Qiskit Aer to use some of the available GPUs?

leofang commented 1 year ago

Hi @doichanj, unfortunately it is not enough, and is not a preferred solution either. CUDA_VISIBLE_DEVICES is a brute-force solution that should only be used when users know exactly what they're doing (usually, HPC users; Dask also uses this internally to do GPU-process binding), but it's not meant for general users, and it certainly does not what usual Python users would do to launch a process.

Typically, Python GPU users expect to choose the GPU at runtime. There are a number of framework specific options:

CuPy: cupy.cuda.Device(0)
PyTorch: torch.device('cuda:0')
TensorFlow: tensorflow.device('/device:GPU:0')
CUDA Python (CUDA's official Python binding): cud.cudart.cudaSetDevice(0)

and these should be honored based on the CUDA Programming Model (the CUDA Runtime APIs would honor the current/active CUDA context).

Moreover, as described in my report, this impacts even the single-GPU users, who might only want to run the CPU backend via AerSimulator(..., device='CPU', ...) for any reason. At least this is how we discovered this bug 🙂 There are many users who just wanna install the battery-included GPU build and pick among all available backends to tailor for their need, and

if they only want to run the CPU backend, GPU initialization is unnecessary
if they only want to runt the GPU backend, GPU initialization should be deferred to as late as AerSimulator(..., device='GPU', ...) is called.

Finally, my above report also listed a number of other impacts, one being import qiskit_aer or even just print(qiskit.__qiskit_version__) would prematurely initializing GPUs. This impacts for example @wshanks who I just noticed is packaging Qiskit Aer on conda-forge (see https://github.com/conda-forge/staged-recipes/pull/21404#issuecomment-1361822587) 😅

basnijholt commented 10 months ago

@doichanj, is fixing this any priority? Who would we have to convince to make this a priority?

I would really like to get CUDA support in the conda packages 😄

doichanj commented 10 months ago

I did not understand the point of this issue, but I have implemented target_gpus option to select GPUs to be used for simulation. But I think this is not the solution for this issue, right? I think we have to change the way to get available devices and methods to avoid initializing GPUs when using CPU simulator, is it what you want?

I had high priority task to release Aer 0.13.1, but I have time to solve this issue now

leofang commented 9 months ago

I think we have to change the way to get available devices and methods to avoid initializing GPUs when using CPU simulator, is it what you want?

Thanks, @doichanj. It is correct. Since we have all the knowledge at compile time (we know what compiler flags are set to build what backends etc), we can just store them as static, readonly arrays, and at run time we query them to see if a backend was built, without ever needing to initialize a CUDA context or calling CUDA APIs 🙂 I'd love to see this fixed asap, as to avoid this issue on the packaging side would take a lot of unnecessary efforts. Also, this has generated multiple bug reports on our side (as we have internal/external multi-GPU users).

MarzioVallero commented 2 days ago

CUDA_VISIBLE_DEVICES is a brute-force solution that should only be used when users know exactly what they're doing (usually, HPC users; Dask also uses this internally to do GPU-process binding), but it's not meant for general users, and it certainly does not what usual Python users would do to launch a process.

Typically, Python GPU users expect to choose the GPU at runtime. There are a number of framework specific options:
* CuPy: `cupy.cuda.Device(0)`

* PyTorch: `torch.device('cuda:0')`

* TensorFlow: `tensorflow.device('/device:GPU:0')`

* CUDA Python (CUDA's official Python binding): `cud.cudart.cudaSetDevice(0)`
and these should be honored based on the CUDA Programming Model (the CUDA Runtime APIs would honor the current/active CUDA context).

Hello eveyone, I was wondering whether the functionality of setting the visible devices at runtime has been actually implemented. I am trying to instantiate multiple python subprocesses through an orchestrator process, and assign a GPU to each subprocess. My goal is to leverage multiple GPUs to independently run in parallel different quantum circuits with different NoiseModels.

However, I am in no luck, as I tried both setting the visible device through cupy.cuda.Device(rank) and the CUDA_VISIBLE_DEVICES environment variable, without any success. At the moment, I have only seen some marginal performance improvement when using batched_shots_gpu=True, and batched_shots_gpu_max_qubits=30 on NVIDIA A100 GPUs. Do you have any advice or suggestion for achieving this per-subprocess independent GPU visibility?

doichanj commented 1 day ago

Please use option target_gpus to specify which GPUs to be used for simulation https://github.com/Qiskit/qiskit-aer/blob/61a557f2bbc62a7942e7eda8da0bff8bcbaa209e/qiskit_aer/backends/aer_simulator.py#L175-L178

Qiskit / qiskit-aer