NAICNO / accelerated_genomics

MIT License
0 stars 1 forks source link

Run GPU pipeline on FOX (interactive login) #4

Open PubuduSaneth opened 1 year ago

PubuduSaneth commented 1 year ago

GPU pipeline raises Bad argument value: Number of GPUs requested (6) is more than number of GPUs

PubuduSaneth commented 1 year ago

When looking into the error, I found that NVIDIA has mentioned that MIG services should be disabled on the GPUs (i.e., Parabricks does not support MIG mode).

Disabling MIG services of GPUs in interactive mode

nvidia-smi -mig 0

(reference - link)

PubuduSaneth commented 1 year ago

Since the NGS-GPU pipeline completed on UiO ML nodes, I checked the MIG status on FOX and ML9. Here I noticed that MIG M. - Enabled on FOX interactive login and MIG M. - N/A on ML-nodes. Agreeing the comment in the NVIDIA post.

Details:

Checking the MIG status on FOX

> salloc --account=ec232 --time=4:00:00 --mem=126G --ntasks=26 --gpus=2 --res=course  --partition=migtest

## MIG M. - Enabled

Sat Nov 11 15:45:47 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 PCIe               On  | 00000000:C1:00.0 Off |                   On |
| N/A   24C    P0              45W / 350W |     89MiB / 81559MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 PCIe               On  | 00000000:E1:00.0 Off |                   On |
| N/A   23C    P0              44W / 350W |     89MiB / 81559MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  1    7   0   0  |              12MiB /  9984MiB  | 14      0 |  1   0    1    0    1 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    8   0   1  |              12MiB /  9984MiB  | 14      0 |  1   0    1    0    1 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    9   0   2  |              12MiB /  9984MiB  | 14      0 |  1   0    1    0    1 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1   11   0   3  |              12MiB /  9984MiB  | 14      0 |  1   0    1    0    1 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Checking the MIG status on ML9

$ nvidia-smi
## MIG M. - N/A

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090         On | 00000000:41:00.0 Off |                  N/A |
| 35%   58C    P2              160W / 350W|  12901MiB / 24576MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090         On | 00000000:61:00.0 Off |                  N/A |
| 33%   50C    P2              154W / 350W|  10235MiB / 24576MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090         On | 00000000:C1:00.0 Off |                  N/A |
| 34%   52C    P2              149W / 350W|  10235MiB / 24576MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090         On | 00000000:E1:00.0 Off |                  N/A |
| 33%   59C    P2              279W / 350W|  16077MiB / 24576MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
PubuduSaneth commented 1 year ago

I was not able to disable the MIG services on FOX interactive login. nvidia-smi -mig 0 raised Insufficient Permissions error and sudo nvidia-smi -mig 0 raised errors when providing the my user password (which I used to loginto the fox.educloud.no)

Details:

$ nvidia-smi -mig 0
Unable to disable MIG Mode for GPU 00000000:C1:00.0: Insufficient Permissions
Terminating early due to previous errors.

Trying with sudo

$ sudo nvidia-smi -mig 0

We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:

    #1) Respect the privacy of others.
    #2) Think before you type.
    #3) With great power comes great responsibility.

[sudo] password for ec-pubuduss:
Sorry, try again.
[sudo] password for ec-pubuduss:
Sorry, try again.
[sudo] password for ec-pubuduss:
sudo: 3 incorrect password attempts
Sabryr commented 1 year ago

You should not try to disable MIG as it will break other things. We do have GPUs without MIG on fox as well. Just for the sake can you please try on Monday, to ask for exactly 1 GPU from your pipeline and see what happens on MIG GPUs.