Open PubuduSaneth opened 1 year ago
When looking into the error, I found that NVIDIA has mentioned that MIG services should be disabled on the GPUs (i.e., Parabricks does not support MIG mode).
Disabling MIG services of GPUs in interactive mode
nvidia-smi -mig 0
(reference - link)
Since the NGS-GPU pipeline completed on UiO ML nodes, I checked the MIG status on FOX and ML9. Here I noticed that MIG M. - Enabled
on FOX interactive login and MIG M. - N/A
on ML-nodes. Agreeing the comment in the NVIDIA post.
Checking the MIG status on FOX
> salloc --account=ec232 --time=4:00:00 --mem=126G --ntasks=26 --gpus=2 --res=course --partition=migtest
## MIG M. - Enabled
Sat Nov 11 15:45:47 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 PCIe On | 00000000:C1:00.0 Off | On |
| N/A 24C P0 45W / 350W | 89MiB / 81559MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA H100 PCIe On | 00000000:E1:00.0 Off | On |
| N/A 23C P0 44W / 350W | 89MiB / 81559MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 1 7 0 0 | 12MiB / 9984MiB | 14 0 | 1 0 1 0 1 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 1 8 0 1 | 12MiB / 9984MiB | 14 0 | 1 0 1 0 1 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 1 9 0 2 | 12MiB / 9984MiB | 14 0 | 1 0 1 0 1 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 1 11 0 3 | 12MiB / 9984MiB | 14 0 | 1 0 1 0 1 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Checking the MIG status on ML9
$ nvidia-smi
## MIG M. - N/A
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:41:00.0 Off | N/A |
| 35% 58C P2 160W / 350W| 12901MiB / 24576MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:61:00.0 Off | N/A |
| 33% 50C P2 154W / 350W| 10235MiB / 24576MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce RTX 3090 On | 00000000:C1:00.0 Off | N/A |
| 34% 52C P2 149W / 350W| 10235MiB / 24576MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce RTX 3090 On | 00000000:E1:00.0 Off | N/A |
| 33% 59C P2 279W / 350W| 16077MiB / 24576MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
I was not able to disable the MIG services on FOX interactive login. nvidia-smi -mig 0
raised Insufficient Permissions
error and sudo nvidia-smi -mig 0
raised errors when providing the my user password (which I used to loginto the fox.educloud.no
)
$ nvidia-smi -mig 0
Unable to disable MIG Mode for GPU 00000000:C1:00.0: Insufficient Permissions
Terminating early due to previous errors.
Trying with sudo
$ sudo nvidia-smi -mig 0
We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:
#1) Respect the privacy of others.
#2) Think before you type.
#3) With great power comes great responsibility.
[sudo] password for ec-pubuduss:
Sorry, try again.
[sudo] password for ec-pubuduss:
Sorry, try again.
[sudo] password for ec-pubuduss:
sudo: 3 incorrect password attempts
You should not try to disable MIG as it will break other things. We do have GPUs without MIG on fox as well. Just for the sake can you please try on Monday, to ask for exactly 1 GPU from your pipeline and see what happens on MIG GPUs.
GPU pipeline raises
Bad argument value: Number of GPUs requested (6) is more than number of GPUs
Java version used for NextFlow run:
Cuda toolkit version for GPU processes:
Detailed error