Azure / azhpc-diagnostics

Scripts that run on Azure VM's and gather variety of diagnostic information to debug common issues with VM, GPU and Infiniband.
MIT License
9 stars 15 forks source link
https://aka.ms/hpcdiag redirects to this repo. OS Versions / Build VM Size Family
Linux ND NC HBv2 HB HC H
Ubuntu 18.04 NDv2Build Status NCv2Build Status HBv2Build Status HC44Build Status
Ubuntu 16.04 NDv1Build Status NCv1Build Status HBuild Status
CentOS 8.1 NDv2Build Status HBv2Build Status HBBuild Status
CentOS 7.8 NCv3Build Status HCBuild Status
CentOS 7.7 NDv2Build Status HBBuild Status
CentOS 7.6 NDv1Build Status NCv2Build Status HBv2Build Status
CentOS 7.4 NCv1Build Status HCBuild Status HBuild Status
RHEL 8.2 NCv3Build Status HCBuild Status
RHEL 8.1 NCv2Build Status HBuild Status
RHEL 7.8 NDv2Build Status HBv2Build Status
RHEL 7.7 NCv2Build Status HBBuild Status
RHEL 7.6 NDv2Build Status HCBuild Status
RHEL 7.5 NCv1Build Status HBuild Status
RHEL 7.4 NDv1Build Status HBv2Build Status

Overview

This repo holds a script that, when run on an Azure VM, gathers a variety of diagnostic information for the purposes of diagnosing common HPC, Infiniband, and GPU problems. It runs a suite of diagnostic tools ranging from built-in Linux tools like lscpu to vendor-specific CLI's like nvidia-smi. The resulting information is packaged up into a tarball, so that it can be shared with support engineers to speed up the troubleshooting process.

If you are reading this, you are likely troubleshooting problems on an Azure HPC VM, in which case we suggest you contact support if you have not already and run this tool on your VM so that you can provide the output to support engineers when prompted.

If you have special privacy requirements concerning logs leaving your VM, make sure to open up the tarball and redact any sensitive information before re-tarring it and handing it off to support engineers.

Warning

This tool is meant for diagnosing inactive systems. It runs benchmarks that stress various system devices such as memory, GPU, and Infiniband. It will cause performance degradation for or otherwise interfere with other active processes that use these resources. It is not advised to use this tool on systems where other jobs are currenlty running.

To stop the tool while it is running, interrupt the process (i.e. ctrl-c) to force it to reset system state and terminate.

Install and Run

After cloning this repo, no further installation is required. To run the script, run the following command, replacing {repo-root} with the name of this repo's directory on your VM:

sudo bash {repo-root}/Linux/src/gather_azhpc_vm_diagnostics.sh

PerfInsights for Linux Integration

Alternatively, a version of this tool is included in PerfInsights for Linux under the HPC scenario. Running this scenario directly from the Azure Portal is not supported at this time, so PerfInsights must be downloaded and run from the command line, but the results of this tool are included in the report generated.

Usage

This section describes the output of the script and the configuration options available.

Options

Option (Short) Option (Long) Parameters Description Example Example Description
-d --dir Directory Name Specify custom output location --dir=. Put the tarball in the current directory
-V --version display version information and exit --version Outputs 0.0.1
-h --help display help text -h Outputs the help message
-v --verbose verbose output --verbose Enables more verbose terminal output
--gpu-level 1 (default), 2, or 3 GPU diagnostics run-level --gpu-level=3 Sets dcgmi run-level to 3
--mem-level 0 (default) or 1 Memory diagnostics run-level --mem-level=1 Enables stream benchmark test
--no-update Disables auto-update --no-update Refrains from checking for updates to the script
--offline Prevents internet access --offline Skips stream benchmark and lsvmbus if not installed

Tarball Structure

Note that not all these files will be generated on all runs. What appears below is union of all files that could be generated, which depends on script parameters and VM size:

{vm-id}.{timestamp}.tar.gz
|-- transcript.log (logs for the tool itself)
|-- hpcdiag.err (stderr output from the run, including set -x trace)
|-- VM
|   -- dmesg.log
|   -- waagent.log
|   -- lspci.txt
|   -- lsvmbus.log
|   -- ipconfig.txt
|   -- sysctl.txt
|   -- uname.txt
|   -- dmidecode.txt
|   -- lsmod.txt
|   -- journald.log|syslog|messages
|   -- services
|   -- selinux
|   -- hyperv/kvp_pool*.txt
|-- CPU
|   -- lscpu.txt
|   -- ulimit
|   -- zone_reclaim_mode
|-- Memory
|   -- stream.txt
|-- Infiniband
|   -- ib-vmext.log
|   -- ibstat.out
|   -- ibstatus.out
|   -- ibv_devinfo.out
|   -- pkeys/*
|   -- ethtool.out (ENDURE)
|   -- rate (ENDURE)
|   -- state (ENDURE)
|   -- phys_state (ENDURE)
|-- Nvidia
    -- nvidia-bug-report.log.gz
    -- nvidia-installer.log
    -- nvidia-vmext.log
    -- nvidia-smi.out
    -- nvidia-smi-q.out
    -- nvidia-smi-nvlink.out
    -- nvidia-debugdump.zip (only Nvidia can read)
    -- dcgm-diag-2.log
    -- dcgm-diag-3.log
    -- nvvs.log
    -- stats_*.json

Diagnostic Tools Table

Tool Command Output File(s) Description EULA
dmesg dmesg VM/dmesg.log Dump of kernel ring buffer
rsyslog cp syslog|messages VM/syslog|messages Dump of system log
journald journalctl VM/journald.log Dump of system log
Azure IMDS curl http://169.254.169.254/metadata/... transcript.log VM Metadata (ID,Region,OS Image, etc)
Azure VM Agent cp /var/log/waagent.log waagent.log Logs from the Azure VM Agent
lspci lspci VM/lspci.txt Info on installed PCI devices
lsvmbus lsvmbus VM/lsvmbus.log Displays devices attached to the Hyper-V VMBus
Hyper-V KVP custom-made VM/hyperv/kvp_pool*.txt Exposes certain Windows Registry data from the Azure Host
ipconfig ipconfig VM/ipconfig.txt Checking TCP/IP configuration
sysctl sysctl VM/sysctl.txt Checking kernel parameters
uname uname VM/uname.txt Checking system information
systemd systemctl VM/services Checking for certain active services (tuning only)
selinux cp /etc/sysconfig/selinux VM/selinux Checking for selinux activity (tuning only)
ulimit cp /etc/security/limits.conf Memory/ulimit Checking for default user resource limits (tuning only)
- cp /proc/sys/vm/zone_reclaim_mode Memory/zone_reclaim_mode Checking NUMA memory reclamation policy (tuning only)
dmidecode dmidecode VM/dmidecode.txt DMI table dump (info on hardware components)
lsmod lsmod VM/lsmod.txt List of active kernel modules
lscpu lscpu CPU/lscpu.txt Information about the system CPU architecture
stream stream_zen_double Memory/stream.txt The stream benchmark suite (AMD Only) Stream License
ibstat ibstat Infiniband/ibstat.out Mellanox OFED command for checking Infiniband status MOFED End-User Agreement
ibstatus ibstatus Infiniband/ibstat.out Lightweight Mellanox OFED command for checking Infiniband status MOFED End-User Agreement
ibv_devinfo ibv_devinfo Infiniband/ibv_devinfo.out Mellanox OFED commnd for checking Infiniband Device info MOFED End-User Agreement
Partition Key cp /sys/class/infiniband/.../pkeys/... Infiniband/.../pkeys/... Checks the configured Infinband Partition Keys
Infiniband Driver Extension Logs cp /var/log/azure/ib-vmext-status Infiniband/ib-vmext-status Logs from the Infiniband Driver Extension
ethtool ethtool eth1 Infiniband/ethtool.out Status of IB interface on ENDURE VMs
sysfs cp /sys/class/infiniband/... Infiniband/rate,state,phys_state Status of IB interface on ENDURE VMs
NVIDIA Bug Report nvidia-bug-report.sh Nvidia/nvidia-bug-report.log.gz A script that Nvidia has customers run when reporting hardware problems. CUDA EULA GRID EULA
NVIDIA System Management Interface nvidia-smi Nvidia/nvidia-smi.out Nvidia/nvidia-smi-q.out Nvidia/nvidia-smi-nvlink.out Checks GPU health and configuration CUDA EULA GRID EULA
NVIDIA Debug Dump nvidia-debugbump Nvidia/nvidia-debugdump.zip Generates a binary blob for use with Nvidia internal engineering tools CUDA EULA GRID EULA
NVIDIA Data Center GPU Manager dcgmi Nvidia/dcgm-diag-2.log Nvidia/dcgm-diag-3.log Nvidia/nvvs.log Nvidia/stats_*.json Health monitoring for GPUs in cluster environments DCGM EULA
GPU Driver Extension Logs cp /var/log/azure/nvidia-vmext-status Nvidia/nvidia-vmext-status Logs from the GPU Driver Extension

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.