Open slow-zhang opened 2 years ago
Individual MIG configurations do not survive a node reboot (only the overall MIG mode setting of a GPU).
It is therefore recommended to configure MIG using the nvidia-mig-parted
tool and it's complementary systemd
service to ensure MIG configurations are reapplied across a reboot.
https://github.com/NVIDIA/mig-parted/tree/main/deployments/systemd
thank you @klueska
is the only the overall MIG mode setting of a GPU
means config by nvidia-smi mig 1
?
Yes, that survives a node reboot, the individual configurations made after MIG mode is enabled do not.
this is cool! thank you Kevin, let me have a try
this is strange, i have run the following commands but the MIG for all GPU is still disabled after reboot
196 2022-11-03 10:18:25 nvidia-smi
198 2022-11-03 10:18:49 nvidia-smi -mig 1
199 2022-11-03 10:19:00 sudo reboot
200 2022-11-03 10:19:04 exit
201 2022-11-03 10:27:26 nvidia-smi
after reboot the status is
root@lasdgx12:~# nvidia-smi
Thu Nov 3 10:27:27 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.203.03 Driver Version: 450.203.03 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:07:00.0 Off | 0 |
| N/A 31C P0 55W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB On | 00000000:0F:00.0 Off | 0 |
| N/A 29C P0 56W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-SXM4-40GB On | 00000000:47:00.0 Off | 0 |
| N/A 29C P0 52W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-SXM4-40GB On | 00000000:4E:00.0 Off | 0 |
| N/A 30C P0 53W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 A100-SXM4-40GB On | 00000000:87:00.0 Off | 0 |
| N/A 37C P0 59W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 A100-SXM4-40GB On | 00000000:90:00.0 Off | 0 |
| N/A 34C P0 55W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 A100-SXM4-40GB On | 00000000:B7:00.0 Off | 0 |
| N/A 34C P0 52W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 A100-SXM4-40GB On | 00000000:BD:00.0 Off | 0 |
| N/A 36C P0 60W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
What did nvidia-smi show after running nvidia-smi -mig 1
but before the reboot
in the mig 1 output,it said you can reboot to make it works, and in nvsmi the it shows a enable with a *
---Original--- From: "Kevin @.> Date: Fri, Nov 4, 2022 00:22 AM To: @.>; Cc: @.>;"State @.>; Subject: Re: [NVIDIA/k8s-device-plugin] Question about MIG config persistent(Issue #343)
What did nvidia-smi show after running nvidia-smi -mig 1 but before the reboot
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you modified the open/close state.Message ID: @.***>
Sorry, your response is garbled, so I don't quite understand. Can you paste the exact output of:
nvidia-smi -mig 1
followed by the exact output of
nvidia-smi
before rebooting
In any case, as I mentioend before, I would recommend against using nvidia-smi
to do your MIG partitioning, and instead use mig-parted
:
https://github.com/NVIDIA/mig-parted
It handles all of the edge cases where trying to use nvidia-smi
directly for MIG management breaks down.
thank you @klueska , i no will continue to work with Enbaling MIG for A100, and here is the outputs
# nvidia-smi -mig 1 # for all gpu
request to connect with running Fabric Manager instance failed with error 111
failed to connect to Fabric Manager instance.
Warning: MIG mode is in pending enable state for GPU 00000000:07:00.0:Timeout
Reboot the system or try nvidia-smi --gpu-reset to make MIG mode effective on GPU 00000000:07:00.0
Warning: MIG mode is in pending enable state for GPU 00000000:0F:00.0:Timeout
Reboot the system or try nvidia-smi --gpu-reset to make MIG mode effective on GPU 00000000:0F:00.0
Warning: MIG mode is in pending enable state for GPU 00000000:47:00.0:Timeout
Reboot the system or try nvidia-smi --gpu-reset to make MIG mode effective on GPU 00000000:47:00.0
Warning: MIG mode is in pending enable state for GPU 00000000:4E:00.0:Timeout
Reboot the system or try nvidia-smi --gpu-reset to make MIG mode effective on GPU 00000000:4E:00.0
Warning: MIG mode is in pending enable state for GPU 00000000:87:00.0:Timeout
Reboot the system or try nvidia-smi --gpu-reset to make MIG mode effective on GPU 00000000:87:00.0
Warning: MIG mode is in pending enable state for GPU 00000000:90:00.0:Timeout
Reboot the system or try nvidia-smi --gpu-reset to make MIG mode effective on GPU 00000000:90:00.0
Warning: MIG mode is in pending enable state for GPU 00000000:B7:00.0:Timeout
Reboot the system or try nvidia-smi --gpu-reset to make MIG mode effective on GPU 00000000:B7:00.0
Warning: MIG mode is in pending enable state for GPU 00000000:BD:00.0:Timeout
Reboot the system or try nvidia-smi --gpu-reset to make MIG mode effective on GPU 00000000:BD:00.0
the nvidia-smi output after i run nvidia-smi -mig 1
and nvidia-smi --gpu-reset
nvidia-smi
Mon Nov 28 01:34:41 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.203.03 Driver Version: 450.203.03 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB Off | 00000000:07:00.0 Off | On |
| N/A 29C P0 51W / 400W | 22MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB Off | 00000000:0F:00.0 Off | On |
| N/A 27C P0 50W / 400W | 22MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-SXM4-40GB Off | 00000000:47:00.0 Off | On |
| N/A 28C P0 45W / 400W | 22MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-SXM4-40GB Off | 00000000:4E:00.0 Off | On |
| N/A 27C P0 45W / 400W | 22MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 4 A100-SXM4-40GB Off | 00000000:87:00.0 Off | On |
| N/A 33C P0 48W / 400W | 22MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 5 A100-SXM4-40GB Off | 00000000:90:00.0 Off | On |
| N/A 32C P0 47W / 400W | 22MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 6 A100-SXM4-40GB Off | 00000000:B7:00.0 Off | On |
| N/A 32C P0 53W / 400W | 22MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 7 A100-SXM4-40GB Off | 00000000:BD:00.0 Off | On |
| N/A 33C P0 54W / 400W | 22MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 1 0 0 | 11MiB / 20096MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 2 0 1 | 11MiB / 20096MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 1 1 0 0 | 11MiB / 20096MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 1 2 0 1 | 11MiB / 20096MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 2 1 0 0 | 11MiB / 20096MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
after reboot, the output of nvidia-smi
# nvidia-smi
Mon Nov 28 01:46:47 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.203.03 Driver Version: 450.203.03 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:07:00.0 Off | 0 |
| N/A 31C P0 57W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB On | 00000000:0F:00.0 Off | 0 |
| N/A 29C P0 57W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-SXM4-40GB On | 00000000:47:00.0 Off | 0 |
| N/A 29C P0 51W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-SXM4-40GB On | 00000000:4E:00.0 Off | 0 |
| N/A 29C P0 50W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 A100-SXM4-40GB On | 00000000:87:00.0 Off | 0 |
| N/A 35C P0 54W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 A100-SXM4-40GB On | 00000000:90:00.0 Off | 0 |
| N/A 33C P0 53W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 A100-SXM4-40GB On | 00000000:B7:00.0 Off | 0 |
| N/A 33C P0 56W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 A100-SXM4-40GB On | 00000000:BD:00.0 Off | 0 |
| N/A 33C P0 59W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
nvidia-smi -mig 1
i want to make smi disable, nvidia-smi -mig 0 however, Warning: MIG mode is in pending disable state for GPU
and nvidia-smi -L also have mig
nvidia-smi -mig 1
i want to make smi disable, nvidia-smi -mig 0 however, Warning: MIG mode is in pending disable state for GPU
and nvidia-smi -L also have mig
thanks for your help to how to enable smi
Hi team, I am setting MIG in a100 GPU but blocked at the very beginning which is the Enable MIG in the machine. could you please help look at this configuration problem when you have time? Appreciated!
here is the init
nvidia-smi
out put:every time i config the gpus to the below and then reboot the config about mig is back to the init config.
nvidia-smi -a