NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
355 stars 49 forks source link

dcgmi diag multiple tests skipped #139

Closed disjustin closed 5 months ago

disjustin commented 7 months ago

I am encountering an issue where multiple diagnostic tests are returning a 'Skip' status and I'm not sure how to debug. I am testing on a Nvidia T400 4GB system on CentOS 7. Might be related to #120 but without additional test parameters.

dcgmi diag -r 3 -j

dcgmi_diag_r3.json

{
    "DCGM GPU Diagnostic" : 
    {
        "test_categories" : 
        [
            {"..."},
            {
                "category" : "Integration",
                "tests" : 
                [
                    {
                        "name" : "PCIe",
                        "results" : 
                        [
                            {
                                "gpu_id" : "0",
                                "status" : "Skip"
                            }
                        ]
                    }
                ]
            },
            {
                "category" : "Hardware",
                "tests" : 
                [
                    {
                        "name" : "GPU Memory",
                        "results" : 
                        [
                            {
                                "gpu_id" : "0",
                                "status" : "Skip"
                            }
                        ]
                    },
                    {
                        "name" : "Diagnostic",
                        "results" : 
                        [
                            {
                                "gpu_id" : "0",
                                "status" : "Skip"
                            }
                        ]
                    },
                    {
                        "name" : "EUD Test",
                        "results" : 
                        [
                            {
                                "gpu_id" : "0",
                                "status" : "Skip"
                            }
                        ]
                    }
                ]
            },
            {
                "category" : "Stress",
                "tests" : 
                [
                    {
                        "name" : "Targeted Stress",
                        "results" : 
                        [
                            {
                                "gpu_id" : "0",
                                "status" : "Skip"
                            }
                        ]
                    },
                    {
                        "name" : "Targeted Power",
                        "results" : 
                        [
                            {
                                "gpu_id" : "0",
                                "status" : "Skip"
                            }
                        ]
                    },
                    {
                        "name" : "Memory Bandwidth",
                        "results" : 
                        [
                            {
                                "gpu_id" : "0",
                                "status" : "Skip"
                            }
                        ]
                    }
                ]
            }
        ]
    },
    "Driver Version Detected" : "545.23.08",
    "GPU Device IDs" : 
    [
        "1ff2"
    ],
    "GPU Device Serials" : 
    {
        "0" : "1423923001498"
    },
    "version" : "3.3.1"
}

diag_r3_2.log filtered ERROR output:

2023-12-07 17:09:52.612 ERROR [73032:73032] Could not read package diag config. Please ensure the datacanter-gpu-manager-config package is installed [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/ConfigFileParser_v2.cpp:218] [DcgmNs::Nvvs::ConfigFileParser_v2::ConfigFileParser_v2]
2023-12-07 17:09:52.612 ERROR [73032:73032] Exception: bad file: /usr/share/nvidia-validation-suite/diag-skus.yaml [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/ConfigFileParser_v2.cpp:220] [DcgmNs::Nvvs::ConfigFileParser_v2::ConfigFileParser_v2]
2023-12-07 17:09:52.864 ERROR [73032:73032] skus is not a sequence; ignoring. Position: -1,-1 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/ConfigFileParser_v2.cpp:329] [DcgmNs::Nvvs::ConfigFileParser_v2::ParseYaml]
2023-12-07 17:09:53.002 ERROR [73032:73032] Couldn't load a definition for ShutdownPlugin in plugin libDiagnostic.so: ./libDiagnostic.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 17:09:53.003 ERROR [73032:73032] Couldn't load a definition for ShutdownPlugin in plugin libTargetedStress.so: ./libTargetedStress.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 17:09:53.013 ERROR [73032:73032] Couldn't load a definition for ShutdownPlugin in plugin libPulseTest.so: ./libPulseTest.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 17:09:53.014 ERROR [73032:73032] Couldn't load a definition for ShutdownPlugin in plugin libEud.so: ./libEud.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 17:09:53.014 ERROR [73032:73032] Couldn't load a definition for ShutdownPlugin in plugin libMemory.so: ./libMemory.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 17:09:53.015 ERROR [73032:73032] Couldn't load a definition for ShutdownPlugin in plugin libContextCreate.so: ./libContextCreate.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 17:09:53.015 ERROR [73032:73032] Couldn't load a definition for ShutdownPlugin in plugin libMemtest.so: ./libMemtest.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 17:09:53.015 ERROR [73032:73032] Couldn't load a definition for ShutdownPlugin in plugin libSoftware.so: ./libSoftware.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 17:09:53.016 ERROR [73032:73032] Couldn't load a definition for ShutdownPlugin in plugin libPcie.so: ./libPcie.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 17:09:53.021 ERROR [73032:73032] Couldn't load a definition for ShutdownPlugin in plugin libTargetedPower.so: ./libTargetedPower.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 17:09:53.027 ERROR [73032:73032] Couldn't load a definition for ShutdownPlugin in plugin libMemoryBandwidth.so: ./libMemoryBandwidth.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 17:10:00.915 ERROR [73043:73043] Could not read package diag config. Please ensure the datacanter-gpu-manager-config package is installed [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/ConfigFileParser_v2.cpp:218] [DcgmNs::Nvvs::ConfigFileParser_v2::ConfigFileParser_v2]
2023-12-07 17:10:00.915 ERROR [73043:73043] Exception: bad file: /usr/share/nvidia-validation-suite/diag-skus.yaml [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/ConfigFileParser_v2.cpp:220] [DcgmNs::Nvvs::ConfigFileParser_v2::ConfigFileParser_v2]
2023-12-07 17:10:01.078 ERROR [73043:73043] skus is not a sequence; ignoring. Position: -1,-1 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/ConfigFileParser_v2.cpp:329] [DcgmNs::Nvvs::ConfigFileParser_v2::ParseYaml]
2023-12-07 17:10:01.195 ERROR [73043:73043] Couldn't load a definition for ShutdownPlugin in plugin libDiagnostic.so: ./libDiagnostic.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 17:10:01.195 ERROR [73043:73043] Couldn't load a definition for ShutdownPlugin in plugin libTargetedStress.so: ./libTargetedStress.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 17:10:01.202 ERROR [73043:73043] Couldn't load a definition for ShutdownPlugin in plugin libPulseTest.so: ./libPulseTest.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 17:10:01.202 ERROR [73043:73043] Couldn't load a definition for ShutdownPlugin in plugin libEud.so: ./libEud.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 17:10:01.202 ERROR [73043:73043] Couldn't load a definition for ShutdownPlugin in plugin libMemory.so: ./libMemory.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 17:10:01.202 ERROR [73043:73043] Couldn't load a definition for ShutdownPlugin in plugin libContextCreate.so: ./libContextCreate.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 17:10:01.203 ERROR [73043:73043] Couldn't load a definition for ShutdownPlugin in plugin libMemtest.so: ./libMemtest.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 17:10:01.203 ERROR [73043:73043] Couldn't load a definition for ShutdownPlugin in plugin libSoftware.so: ./libSoftware.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 17:10:01.203 ERROR [73043:73043] Couldn't load a definition for ShutdownPlugin in plugin libPcie.so: ./libPcie.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 17:10:01.206 ERROR [73043:73043] Couldn't load a definition for ShutdownPlugin in plugin libTargetedPower.so: ./libTargetedPower.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 17:10:01.211 ERROR [73043:73043] Couldn't load a definition for ShutdownPlugin in plugin libMemoryBandwidth.so: ./libMemoryBandwidth.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]

dcgmi -v

Version : 3.3.1
Build ID : 2
Build Date : 2023-11-17
Build Type : Release
Commit ID : bae0ca1b47cce26af80b9a79defb26b0d36239f8
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : 84bced15c8f0835222021abf0f88ab83

Hostengine build info:
Version : 3.3.0
Build ID : 1
Build Date : 2023-10-06
Build Type : Release
Commit ID : c62077bb678b2fa7d84235f290371c84044d112f
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : 439751dad784a082d463a02821eaaf6a

Thank you anyone!

nikkon-dev commented 7 months ago

@disjustin,

What is in this file /usr/share/nvidia-validation-suite/diag-skus.yaml ?

disjustin commented 7 months ago

@nikkon-dev The file does not exist

Rcarballo2222 commented 5 months ago

@disjustin were you able to resolve this issue? Running into a similar issue with all the tests in the Stress category being Skipped

disjustin commented 5 months ago

@Rcarballo2222 I had borrowed this system temporarily and can't confirm anymore. You may want to check out #120 or #136.