NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.86k stars 634 forks source link

Question about MIG config persistent #343

Open slow-zhang opened 2 years ago

slow-zhang commented 2 years ago

Hi team, I am setting MIG in a100 GPU but blocked at the very beginning which is the Enable MIG in the machine. could you please help look at this configuration problem when you have time? Appreciated!

here is the init nvidia-smi out put:

image

every time i config the gpus to the below and then reboot the config about mig is back to the init config.

image

nvidia-smi -a

Timestamp                                 : Thu Nov  3 05:46:34 2022
Driver Version                            : 450.203.03
CUDA Version                              : 11.0

Attached GPUs                             : 8
GPU 00000000:07:00.0
    Product Name                          : A100-SXM4-40GB
    Product Brand                         : Tesla
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1562021001122
    GPU UUID                              : GPU-36c9561a-02ef-5c4d-3733-f0a56c42dc45
    Minor Number                          : 0
    VBIOS Version                         : 92.00.45.00.03
    MultiGPU Board                        : No
    Board ID                              : 0x700
    GPU Part Number                       : 692-2G506-0200-002
    Inforom Version
        Image Version                     : G506.0200.00.04
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x07
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x20B010DE
        Bus Id                            : 00000000:07:00.0
        Sub System Id                     : 0x134F10DE
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : N/A
    Performance State                     : P0
...

    Ecc Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 640 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 31 C
        GPU Shutdown Temp                 : 92 C
        GPU Slowdown Temp                 : 89 C
        GPU Max Operating Temp            : 85 C
        Memory Current Temp               : 30 C
        Memory Max Operating Temp         : 95 C
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 57.13 W
        Power Limit                       : 400.00 W
        Default Power Limit               : 400.00 W
        Enforced Power Limit              : 400.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 400.00 W
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 1215 MHz
        Video                             : 585 MHz
    Applications Clocks
        Graphics                          : 1095 MHz
        Memory                            : 1215 MHz
    Default Applications Clocks
        Graphics                          : 1095 MHz
        Memory                            : 1215 MHz
    Max Clocks
        Graphics                          : 1410 MHz
        SM                                : 1410 MHz
        Memory                            : 1215 MHz
        Video                             : 1290 MHz
    Max Customer Boost Clocks
        Graphics                          : 1410 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes                             : None
klueska commented 2 years ago

Individual MIG configurations do not survive a node reboot (only the overall MIG mode setting of a GPU).

It is therefore recommended to configure MIG using the nvidia-mig-parted tool and it's complementary systemd service to ensure MIG configurations are reapplied across a reboot.

https://github.com/NVIDIA/mig-parted/tree/main/deployments/systemd

slow-zhang commented 2 years ago

thank you @klueska

slow-zhang commented 2 years ago

is the only the overall MIG mode setting of a GPU means config by nvidia-smi mig 1?

klueska commented 2 years ago

Yes, that survives a node reboot, the individual configurations made after MIG mode is enabled do not.

slow-zhang commented 2 years ago

this is cool! thank you Kevin, let me have a try

slow-zhang commented 2 years ago

this is strange, i have run the following commands but the MIG for all GPU is still disabled after reboot

  196  2022-11-03 10:18:25 nvidia-smi
  198  2022-11-03 10:18:49 nvidia-smi -mig 1
  199  2022-11-03 10:19:00 sudo reboot
  200  2022-11-03 10:19:04 exit
  201  2022-11-03 10:27:26 nvidia-smi

after reboot the status is

root@lasdgx12:~# nvidia-smi
Thu Nov  3 10:27:27 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.203.03   Driver Version: 450.203.03   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                    0 |
| N/A   31C    P0    55W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      On   | 00000000:0F:00.0 Off |                    0 |
| N/A   29C    P0    56W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      On   | 00000000:47:00.0 Off |                    0 |
| N/A   29C    P0    52W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      On   | 00000000:4E:00.0 Off |                    0 |
| N/A   30C    P0    53W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  A100-SXM4-40GB      On   | 00000000:87:00.0 Off |                    0 |
| N/A   37C    P0    59W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  A100-SXM4-40GB      On   | 00000000:90:00.0 Off |                    0 |
| N/A   34C    P0    55W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  A100-SXM4-40GB      On   | 00000000:B7:00.0 Off |                    0 |
| N/A   34C    P0    52W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  A100-SXM4-40GB      On   | 00000000:BD:00.0 Off |                    0 |
| N/A   36C    P0    60W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
klueska commented 2 years ago

What did nvidia-smi show after running nvidia-smi -mig 1 but before the reboot

slow-zhang commented 2 years ago

in the mig 1 output,it said you can reboot to make  it works, and in nvsmi the  it shows a enable with a *

---Original--- From: "Kevin @.> Date: Fri, Nov 4, 2022 00:22 AM To: @.>; Cc: @.>;"State @.>; Subject: Re: [NVIDIA/k8s-device-plugin] Question about MIG config persistent(Issue #343)

What did nvidia-smi show after running nvidia-smi -mig 1 but before the reboot

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you modified the open/close state.Message ID: @.***>

klueska commented 2 years ago

Sorry, your response is garbled, so I don't quite understand. Can you paste the exact output of:

nvidia-smi -mig 1

followed by the exact output of

nvidia-smi

before rebooting

klueska commented 2 years ago

In any case, as I mentioend before, I would recommend against using nvidia-smi to do your MIG partitioning, and instead use mig-parted: https://github.com/NVIDIA/mig-parted

It handles all of the edge cases where trying to use nvidia-smi directly for MIG management breaks down.

slow-zhang commented 2 years ago

thank you @klueska , i no will continue to work with Enbaling MIG for A100, and here is the outputs

# nvidia-smi -mig 1 # for all gpu
request to connect with running Fabric Manager instance failed with error 111
failed to connect to Fabric Manager instance.
Warning: MIG mode is in pending enable state for GPU 00000000:07:00.0:Timeout
Reboot the system or try nvidia-smi --gpu-reset to make MIG mode effective on GPU 00000000:07:00.0
Warning: MIG mode is in pending enable state for GPU 00000000:0F:00.0:Timeout
Reboot the system or try nvidia-smi --gpu-reset to make MIG mode effective on GPU 00000000:0F:00.0
Warning: MIG mode is in pending enable state for GPU 00000000:47:00.0:Timeout
Reboot the system or try nvidia-smi --gpu-reset to make MIG mode effective on GPU 00000000:47:00.0

Warning: MIG mode is in pending enable state for GPU 00000000:4E:00.0:Timeout
Reboot the system or try nvidia-smi --gpu-reset to make MIG mode effective on GPU 00000000:4E:00.0
Warning: MIG mode is in pending enable state for GPU 00000000:87:00.0:Timeout
Reboot the system or try nvidia-smi --gpu-reset to make MIG mode effective on GPU 00000000:87:00.0
Warning: MIG mode is in pending enable state for GPU 00000000:90:00.0:Timeout
Reboot the system or try nvidia-smi --gpu-reset to make MIG mode effective on GPU 00000000:90:00.0
Warning: MIG mode is in pending enable state for GPU 00000000:B7:00.0:Timeout
Reboot the system or try nvidia-smi --gpu-reset to make MIG mode effective on GPU 00000000:B7:00.0
Warning: MIG mode is in pending enable state for GPU 00000000:BD:00.0:Timeout
Reboot the system or try nvidia-smi --gpu-reset to make MIG mode effective on GPU 00000000:BD:00.0

the nvidia-smi output after i run nvidia-smi -mig 1 and nvidia-smi --gpu-reset

nvidia-smi
Mon Nov 28 01:34:41 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.203.03   Driver Version: 450.203.03   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      Off  | 00000000:07:00.0 Off |                   On |
| N/A   29C    P0    51W / 400W |     22MiB / 40537MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      Off  | 00000000:0F:00.0 Off |                   On |
| N/A   27C    P0    50W / 400W |     22MiB / 40537MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      Off  | 00000000:47:00.0 Off |                   On |
| N/A   28C    P0    45W / 400W |     22MiB / 40537MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      Off  | 00000000:4E:00.0 Off |                   On |
| N/A   27C    P0    45W / 400W |     22MiB / 40537MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   4  A100-SXM4-40GB      Off  | 00000000:87:00.0 Off |                   On |
| N/A   33C    P0    48W / 400W |     22MiB / 40537MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   5  A100-SXM4-40GB      Off  | 00000000:90:00.0 Off |                   On |
| N/A   32C    P0    47W / 400W |     22MiB / 40537MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   6  A100-SXM4-40GB      Off  | 00000000:B7:00.0 Off |                   On |
| N/A   32C    P0    53W / 400W |     22MiB / 40537MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   7  A100-SXM4-40GB      Off  | 00000000:BD:00.0 Off |                   On |
| N/A   33C    P0    54W / 400W |     22MiB / 40537MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    1   0   0  |     11MiB / 20096MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    2   0   1  |     11MiB / 20096MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    1   0   0  |     11MiB / 20096MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    2   0   1  |     11MiB / 20096MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2    1   0   0  |     11MiB / 20096MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+

after reboot, the output of nvidia-smi

# nvidia-smi
Mon Nov 28 01:46:47 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.203.03   Driver Version: 450.203.03   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                    0 |
| N/A   31C    P0    57W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      On   | 00000000:0F:00.0 Off |                    0 |
| N/A   29C    P0    57W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      On   | 00000000:47:00.0 Off |                    0 |
| N/A   29C    P0    51W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      On   | 00000000:4E:00.0 Off |                    0 |
| N/A   29C    P0    50W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  A100-SXM4-40GB      On   | 00000000:87:00.0 Off |                    0 |
| N/A   35C    P0    54W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  A100-SXM4-40GB      On   | 00000000:90:00.0 Off |                    0 |
| N/A   33C    P0    53W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  A100-SXM4-40GB      On   | 00000000:B7:00.0 Off |                    0 |
| N/A   33C    P0    56W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  A100-SXM4-40GB      On   | 00000000:BD:00.0 Off |                    0 |
| N/A   33C    P0    59W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
github-actions[bot] commented 9 months ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

riverind commented 4 months ago

nvidia-smi -mig 1

i want to make smi disable, nvidia-smi -mig 0 however, Warning: MIG mode is in pending disable state for GPU

and nvidia-smi -L also have mig

riverind commented 4 months ago

nvidia-smi -mig 1

i want to make smi disable, nvidia-smi -mig 0 however, Warning: MIG mode is in pending disable state for GPU

and nvidia-smi -L also have mig

thanks for your help to how to enable smi