NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
911 stars 157 forks source link

Extracting errors and bugs in k8s environment #16

Open guleng opened 3 years ago

guleng commented 3 years ago

I have a pod in status Completed, and I use a GPU card ‘kubectl describe node gpu-178‘ View and from exporte dissimilarity,Obviously, dcgm exporter has included the cards of the completed pod

glowkey commented 3 years ago

I don't think I understand the issue you are reporting. Are you expecting a different status for the dcgm-exporter pod?

guleng commented 3 years ago

@glowkey The service statistics using GPU card are wrong. I have multiple jobs occupying GPU cards, but they release the GPU card when the task is completed, but the exporter calculates that they are occupying the GPU card. metrics name is (DCGM_FI_DEV_FB_FREE)

glowkey commented 3 years ago

@guleng nvidia-smi -q will also display the FB memory used/free. You can use that to verify the information displayed by the exporter. If they are different please include examples and outputs.

guleng commented 3 years ago

When I checked the usage of Gou in k8s, I saw that only four cards were used image What I found in dcgm exporter is that five are used image

deepblue@gpu-178:~$ nvidia-smi -q

==============NVSMI LOG==============

Timestamp                                 : Wed Sep 29 10:16:18 2021
Driver Version                            : 460.32.03
CUDA Version                              : 11.2

Attached GPUs                             : 8
GPU 00000000:3D:00.0
    Product Name                          : Tesla P100-PCIE-16GB
    Product Brand                         : Tesla
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 0321619055326
    GPU UUID                              : GPU-8db102ce-42a0-03d6-f16b-a9c7f3c415f7
    Minor Number                          : 0
    VBIOS Version                         : 86.00.4D.00.01
    MultiGPU Board                        : No
    Board ID                              : 0x3d00
    GPU Part Number                       : 900-2H400-0000-000
    Inforom Version
        Image Version                     : H400.0201.00.08
        OEM Object                        : 1.1
        ECC Object                        : 4.1
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x3D
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x15F810DE
        Bus Id                            : 00000000:3D:00.0
        Sub System Id                     : 0x118F10DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 16280 MiB
        Used                              : 0 MiB
        Free                              : 16280 MiB
    BAR1 Memory Usage
        Total                             : 16384 MiB
        Used                              : 2 MiB
        Free                              : 16382 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
        Aggregate
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
    Retired Pages
        Single Bit ECC                    : 0
        Double Bit ECC                    : 0
        Pending Page Blacklist            : No
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 24 C
        GPU Shutdown Temp                 : 85 C
        GPU Slowdown Temp                 : 82 C
        GPU Max Operating Temp            : N/A
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 25.58 W
        Power Limit                       : 250.00 W
        Default Power Limit               : 250.00 W
        Enforced Power Limit              : 250.00 W
        Min Power Limit                   : 125.00 W
        Max Power Limit                   : 250.00 W
    Clocks
        Graphics                          : 405 MHz
        SM                                : 405 MHz
        Memory                            : 715 MHz
        Video                             : 835 MHz
    Applications Clocks
        Graphics                          : 1189 MHz
        Memory                            : 715 MHz
    Default Applications Clocks
        Graphics                          : 1189 MHz
        Memory                            : 715 MHz
    Max Clocks
        Graphics                          : 1328 MHz
        SM                                : 1328 MHz
        Memory                            : 715 MHz
        Video                             : 1328 MHz
    Max Customer Boost Clocks
        Graphics                          : 1328 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes                             : None

GPU 00000000:3E:00.0
    Product Name                          : Tesla P100-PCIE-16GB
    Product Brand                         : Tesla
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 0321619055580
    GPU UUID                              : GPU-aac9cf08-dd4e-9f7b-7fa8-8e1b8ecaa566
    Minor Number                          : 1
    VBIOS Version                         : 86.00.4D.00.01
    MultiGPU Board                        : No
    Board ID                              : 0x3e00
    GPU Part Number                       : 900-2H400-0000-000
    Inforom Version
        Image Version                     : H400.0201.00.08
        OEM Object                        : 1.1
        ECC Object                        : 4.1
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x3E
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x15F810DE
        Bus Id                            : 00000000:3E:00.0
        Sub System Id                     : 0x118F10DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 16280 MiB
        Used                              : 0 MiB
        Free                              : 16280 MiB
    BAR1 Memory Usage
        Total                             : 16384 MiB
        Used                              : 2 MiB
        Free                              : 16382 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
        Aggregate
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
    Retired Pages
        Single Bit ECC                    : 0
        Double Bit ECC                    : 0
        Pending Page Blacklist            : No
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 23 C
        GPU Shutdown Temp                 : 85 C
        GPU Slowdown Temp                 : 82 C
        GPU Max Operating Temp            : N/A
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 25.82 W
        Power Limit                       : 250.00 W
        Default Power Limit               : 250.00 W
        Enforced Power Limit              : 250.00 W
        Min Power Limit                   : 125.00 W
        Max Power Limit                   : 250.00 W
    Clocks
        Graphics                          : 405 MHz
        SM                                : 405 MHz
        Memory                            : 715 MHz
        Video                             : 835 MHz
    Applications Clocks
        Graphics                          : 1189 MHz
        Memory                            : 715 MHz
    Default Applications Clocks
        Graphics                          : 1189 MHz
        Memory                            : 715 MHz
    Max Clocks
        Graphics                          : 1328 MHz
        SM                                : 1328 MHz
        Memory                            : 715 MHz
        Video                             : 1328 MHz
    Max Customer Boost Clocks
        Graphics                          : 1328 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes                             : None

GPU 00000000:40:00.0
    Product Name                          : Tesla P100-PCIE-16GB
    Product Brand                         : Tesla
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 0321619054568
    GPU UUID                              : GPU-5d40d611-0236-e9ba-0abc-dd5a69a90596
    Minor Number                          : 2
    VBIOS Version                         : 86.00.4D.00.01
    MultiGPU Board                        : No
    Board ID                              : 0x4000
    GPU Part Number                       : 900-2H400-0000-000
    Inforom Version
        Image Version                     : H400.0201.00.08
        OEM Object                        : 1.1
        ECC Object                        : 4.1
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x40
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x15F810DE
        Bus Id                            : 00000000:40:00.0
        Sub System Id                     : 0x118F10DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 16280 MiB
        Used                              : 0 MiB
        Free                              : 16280 MiB
    BAR1 Memory Usage
        Total                             : 16384 MiB
        Used                              : 2 MiB
        Free                              : 16382 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
        Aggregate
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
    Retired Pages
        Single Bit ECC                    : 0
        Double Bit ECC                    : 0
        Pending Page Blacklist            : No
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 24 C
        GPU Shutdown Temp                 : 85 C
        GPU Slowdown Temp                 : 82 C
        GPU Max Operating Temp            : N/A
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 27.05 W
        Power Limit                       : 250.00 W
        Default Power Limit               : 250.00 W
        Enforced Power Limit              : 250.00 W
        Min Power Limit                   : 125.00 W
        Max Power Limit                   : 250.00 W
    Clocks
        Graphics                          : 405 MHz
        SM                                : 405 MHz
        Memory                            : 715 MHz
        Video                             : 835 MHz
    Applications Clocks
        Graphics                          : 1189 MHz
        Memory                            : 715 MHz
    Default Applications Clocks
        Graphics                          : 1189 MHz
        Memory                            : 715 MHz
    Max Clocks
        Graphics                          : 1328 MHz
        SM                                : 1328 MHz
        Memory                            : 715 MHz
        Video                             : 1328 MHz
    Max Customer Boost Clocks
        Graphics                          : 1328 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes                             : None

GPU 00000000:41:00.0
    Product Name                          : Tesla P100-PCIE-16GB
    Product Brand                         : Tesla
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 0324518129938
    GPU UUID                              : GPU-e838377e-3604-e4ac-839d-362d51dc641d
    Minor Number                          : 3
    VBIOS Version                         : 86.00.4D.00.01
    MultiGPU Board                        : No
    Board ID                              : 0x4100
    GPU Part Number                       : 900-2H400-0000-000
    Inforom Version
        Image Version                     : H400.0201.00.08
        OEM Object                        : 1.1
        ECC Object                        : 4.1
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x41
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x15F810DE
        Bus Id                            : 00000000:41:00.0
        Sub System Id                     : 0x118F10DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Not Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 16280 MiB
        Used                              : 1055 MiB
        Free                              : 15225 MiB
    BAR1 Memory Usage
        Total                             : 16384 MiB
        Used                              : 2 MiB
        Free                              : 16382 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
        Aggregate
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
    Retired Pages
        Single Bit ECC                    : 0
        Double Bit ECC                    : 0
        Pending Page Blacklist            : No
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 28 C
        GPU Shutdown Temp                 : 85 C
        GPU Slowdown Temp                 : 82 C
        GPU Max Operating Temp            : N/A
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 31.17 W
        Power Limit                       : 250.00 W
        Default Power Limit               : 250.00 W
        Enforced Power Limit              : 250.00 W
        Min Power Limit                   : 125.00 W
        Max Power Limit                   : 250.00 W
    Clocks
        Graphics                          : 1189 MHz
        SM                                : 1189 MHz
        Memory                            : 715 MHz
        Video                             : 1075 MHz
    Applications Clocks
        Graphics                          : 1189 MHz
        Memory                            : 715 MHz
    Default Applications Clocks
        Graphics                          : 1189 MHz
        Memory                            : 715 MHz
    Max Clocks
        Graphics                          : 1328 MHz
        SM                                : 1328 MHz
        Memory                            : 715 MHz
        Video                             : 1328 MHz
    Max Customer Boost Clocks
        Graphics                          : 1328 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 180997
            Type                          : C
            Name                          : python
            Used GPU Memory               : 1053 MiB

GPU 00000000:B1:00.0
    Product Name                          : Tesla P100-PCIE-16GB
    Product Brand                         : Tesla
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 0321619055598
    GPU UUID                              : GPU-947d6215-69d2-5f3c-17a8-23e255c3d02f
    Minor Number                          : 4
    VBIOS Version                         : 86.00.4D.00.01
    MultiGPU Board                        : No
    Board ID                              : 0xb100
    GPU Part Number                       : 900-2H400-0000-000
    Inforom Version
        Image Version                     : H400.0201.00.08
        OEM Object                        : 1.1
        ECC Object                        : 4.1
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0xB1
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x15F810DE
        Bus Id                            : 00000000:B1:00.0
        Sub System Id                     : 0x118F10DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 16280 MiB
        Used                              : 0 MiB
        Free                              : 16280 MiB
    BAR1 Memory Usage
        Total                             : 16384 MiB
        Used                              : 2 MiB
        Free                              : 16382 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
        Aggregate
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
    Retired Pages
        Single Bit ECC                    : 0
        Double Bit ECC                    : 0
        Pending Page Blacklist            : No
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 24 C
        GPU Shutdown Temp                 : 85 C
        GPU Slowdown Temp                 : 82 C
        GPU Max Operating Temp            : N/A
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 24.61 W
        Power Limit                       : 250.00 W
        Default Power Limit               : 250.00 W
        Enforced Power Limit              : 250.00 W
        Min Power Limit                   : 125.00 W
        Max Power Limit                   : 250.00 W
    Clocks
        Graphics                          : 405 MHz
        SM                                : 405 MHz
        Memory                            : 715 MHz
        Video                             : 835 MHz
    Applications Clocks
        Graphics                          : 1189 MHz
        Memory                            : 715 MHz
    Default Applications Clocks
        Graphics                          : 1189 MHz
        Memory                            : 715 MHz
    Max Clocks
        Graphics                          : 1328 MHz
        SM                                : 1328 MHz
        Memory                            : 715 MHz
        Video                             : 1328 MHz
    Max Customer Boost Clocks
        Graphics                          : 1328 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes                             : None

GPU 00000000:B2:00.0
    Product Name                          : Tesla P100-PCIE-16GB
    Product Brand                         : Tesla
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 0321619055222
    GPU UUID                              : GPU-608e0214-1c9c-e52d-7559-a18e30803830
    Minor Number                          : 5
    VBIOS Version                         : 86.00.4D.00.01
    MultiGPU Board                        : No
    Board ID                              : 0xb200
    GPU Part Number                       : 900-2H400-0000-000
    Inforom Version
        Image Version                     : H400.0201.00.08
        OEM Object                        : 1.1
        ECC Object                        : 4.1
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0xB2
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x15F810DE
        Bus Id                            : 00000000:B2:00.0
        Sub System Id                     : 0x118F10DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 16280 MiB
        Used                              : 0 MiB
        Free                              : 16280 MiB
    BAR1 Memory Usage
        Total                             : 16384 MiB
        Used                              : 2 MiB
        Free                              : 16382 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
        Aggregate
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
    Retired Pages
        Single Bit ECC                    : 0
        Double Bit ECC                    : 0
        Pending Page Blacklist            : No
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 24 C
        GPU Shutdown Temp                 : 85 C
        GPU Slowdown Temp                 : 82 C
        GPU Max Operating Temp            : N/A
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 26.56 W
        Power Limit                       : 250.00 W
        Default Power Limit               : 250.00 W
        Enforced Power Limit              : 250.00 W
        Min Power Limit                   : 125.00 W
        Max Power Limit                   : 250.00 W
    Clocks
        Graphics                          : 405 MHz
        SM                                : 405 MHz
        Memory                            : 715 MHz
        Video                             : 835 MHz
    Applications Clocks
        Graphics                          : 1189 MHz
        Memory                            : 715 MHz
    Default Applications Clocks
        Graphics                          : 1189 MHz
        Memory                            : 715 MHz
    Max Clocks
        Graphics                          : 1328 MHz
        SM                                : 1328 MHz
        Memory                            : 715 MHz
        Video                             : 1328 MHz
    Max Customer Boost Clocks
        Graphics                          : 1328 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes                             : None

GPU 00000000:B4:00.0
    Product Name                          : Tesla P100-PCIE-16GB
    Product Brand                         : Tesla
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 0321619054891
    GPU UUID                              : GPU-358cc925-7017-36cc-5a44-25b679ff6453
    Minor Number                          : 6
    VBIOS Version                         : 86.00.4D.00.01
    MultiGPU Board                        : No
    Board ID                              : 0xb400
    GPU Part Number                       : 900-2H400-0000-000
    Inforom Version
        Image Version                     : H400.0201.00.08
        OEM Object                        : 1.1
        ECC Object                        : 4.1
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0xB4
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x15F810DE
        Bus Id                            : 00000000:B4:00.0
        Sub System Id                     : 0x118F10DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 16280 MiB
        Used                              : 0 MiB
        Free                              : 16280 MiB
    BAR1 Memory Usage
        Total                             : 16384 MiB
        Used                              : 2 MiB
        Free                              : 16382 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
        Aggregate
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
    Retired Pages
        Single Bit ECC                    : 0
        Double Bit ECC                    : 0
        Pending Page Blacklist            : No
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 24 C
        GPU Shutdown Temp                 : 85 C
        GPU Slowdown Temp                 : 82 C
        GPU Max Operating Temp            : N/A
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 25.33 W
        Power Limit                       : 250.00 W
        Default Power Limit               : 250.00 W
        Enforced Power Limit              : 250.00 W
        Min Power Limit                   : 125.00 W
        Max Power Limit                   : 250.00 W
    Clocks
        Graphics                          : 405 MHz
        SM                                : 405 MHz
        Memory                            : 715 MHz
        Video                             : 835 MHz
    Applications Clocks
        Graphics                          : 1189 MHz
        Memory                            : 715 MHz
    Default Applications Clocks
        Graphics                          : 1189 MHz
        Memory                            : 715 MHz
    Max Clocks
        Graphics                          : 1328 MHz
        SM                                : 1328 MHz
        Memory                            : 715 MHz
        Video                             : 1328 MHz
    Max Customer Boost Clocks
        Graphics                          : 1328 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes                             : None

GPU 00000000:B5:00.0
    Product Name                          : Tesla P100-PCIE-16GB
    Product Brand                         : Tesla
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 0321619054768
    GPU UUID                              : GPU-17375115-1b53-f390-4496-3f2ab536acb4
    Minor Number                          : 7
    VBIOS Version                         : 86.00.4D.00.01
    MultiGPU Board                        : No
    Board ID                              : 0xb500
    GPU Part Number                       : 900-2H400-0000-000
    Inforom Version
        Image Version                     : H400.0201.00.08
        OEM Object                        : 1.1
        ECC Object                        : 4.1
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0xB5
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x15F810DE
        Bus Id                            : 00000000:B5:00.0
        Sub System Id                     : 0x118F10DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 16280 MiB
        Used                              : 0 MiB
        Free                              : 16280 MiB
    BAR1 Memory Usage
        Total                             : 16384 MiB
        Used                              : 2 MiB
        Free                              : 16382 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
        Aggregate
            Single Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
            Double Bit            
                Device Memory             : 0
                Register File             : 0
                L1 Cache                  : N/A
                L2 Cache                  : 0
                Texture Memory            : 0
                Texture Shared            : 0
                CBU                       : N/A
                Total                     : 0
    Retired Pages
        Single Bit ECC                    : 0
        Double Bit ECC                    : 0
        Pending Page Blacklist            : No
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 27 C
        GPU Shutdown Temp                 : 85 C
        GPU Slowdown Temp                 : 82 C
        GPU Max Operating Temp            : N/A
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 26.55 W
        Power Limit                       : 250.00 W
        Default Power Limit               : 250.00 W
        Enforced Power Limit              : 250.00 W
        Min Power Limit                   : 125.00 W
        Max Power Limit                   : 250.00 W
    Clocks
        Graphics                          : 405 MHz
        SM                                : 405 MHz
        Memory                            : 715 MHz
        Video                             : 835 MHz
    Applications Clocks
        Graphics                          : 1189 MHz
        Memory                            : 715 MHz
    Default Applications Clocks
        Graphics                          : 1189 MHz
        Memory                            : 715 MHz
    Max Clocks
        Graphics                          : 1328 MHz
        SM                                : 1328 MHz
        Memory                            : 715 MHz
        Video                             : 1328 MHz
    Max Customer Boost Clocks
        Graphics                          : 1328 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes                             : None

The pod is in the completed state and cannot be counted as GPU card image

This should be caused by the lack of judgment when the exporter obtains from /var/lib/kubelet/pod-resources

nikkon-dev commented 3 years ago

@guleng,

It's not clear what do you want to change here.

The resource will be associated with a pod until the pod is terminated.

See some K8s sources for reference:

WBR, Nik

guleng commented 3 years ago

The meaning is very simple. The task service starts to apply for GPU Card Association, and then ends to complete the task. This means that it is time to release the GPU card. Viewing in the k8s cluster and the nvidia-smi command show that the GPU card is in free state, but the metrics given by the dcgm-exporter are the completed job and the GPU card or in associated state.

My command to deploy dcgm exporter is:kubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml

I suspect that the reason for this problem is that the metrics to obtain the information related to the pod and GPU card is not determined when it is obtained from here? image

nikkon-dev commented 3 years ago

Hi @guleng,

The dcgm-exporter acquires the information related to pod resources on every metrics request via POD API - that very /var/lib/kubelet/pod-resources socket. We do not cache or store that information anywhere but the metrics' labels.

K8s does not terminate a pod instantly once a container is completed. There might be another task that k8s could assign to the same pod in a short time, so k8s keeps the pod for a while. And resources are allocated for pods, not for containers. Until k8s decides to terminate a pod, the resources will remain allocated and assigned to that pod. I referenced the exact locations where k8s makes decisions on resources deallocation in a comment above.

There is nothing dcgm-exporter can do about this - we have to trust the information k8s returns us.

To emphasize this again: a completed job does not mean the resource (a GPU allocated to a pod) is freed. Only when the pod is terminated, the resource is deallocated.

WBR, Nik

daniel-hutao commented 10 months ago

@nikkon-dev Hey buddy, I followed the code link you provided earlier to trace the device release process in Kubelet, and it seems there are no issues.

However, I've encountered the same problem: the metric update is a few minutes slower than the resource release seen through kubectl. I want to know where this delay of several minutes is occurring. In short, when a pod is completed, the GPU resources are immediately visible as released through kubectl, but it takes more than 2 minutes to see the metric update through dcgm-exporter. If you know the reason, please let me know. Thanks a lot.