Automated installation of GH200 system using Ubuntu 22.04 using USB drive
A lot of this came from the Official Nvidia Ubuntu 22.04 Grace Installation Guide. If you have problems or want to install manually, please see that guide for more details. There is also a Grace Performance Tuning Guide that may be helpful.
This repo was created based on testing of the GH200 system (specifically the Supermicro ARS-111GL-NHR). The systems I'm testing on also have a single Bluiefield-3 installed (although 2 BF-3 have also been tested), but this should not matter for the installation.
The contained scripts will perform the following actions:
linux-nvidia-64k-hwe-22.04
the default kernel(NOTE: Nvidia requires MLNX-OFED to be installed BEFORE the GPU drivers for NCCL to work correctly)
irqbalance
service (recommended by performance tuning guide)There are multiple ways to create an installable Ubuntu 22.04 USB drive. I used Rufus on a Windows-11 system.
Before copying over the files, you'll need to customize the cidata/user-data file with your installation details
Replace the following items: | Item | Description |
---|---|---|
\ |
Hostname of the system | |
\ |
Generated SHA-512 hash (can generate with openssl passwd -6 ) |
|
\ |
Initial User name | |
\ |
Address of network port in CIDR form (e.g. 10.1.1.10/24) | |
\ |
Address of network gateway | |
\ |
Address of DNS Nameservers | |
\ |
DNS Search domain (e.g. my-domain.com) |
After creating a bootable Ubuntu installation drive, copy the files from cidata to the
Create directory cidata
in the root of the Ubuntu USB drive
Copy All files over to the cidata directory on the Ubuntu USB drive
Update the boot/grub/grub.cfg
file and add the following menuentry to the list:
menuentry "Install GH200 System (Requires Internet)" {
set gfxpayload=keep
linux /casper/vmlinuz quiet autoinstall 'ds=nocloud-net;s=file:///cdrom/cidata/'
initrd /casper/initrd
}
NOTE: There is a bug with the Supermicro ARS-111GL-NHR that causes the system to hang when rebooting with a USB drive connected. Unknown if this affects all GH200 systems. After installation, it's likely the screen is stuck. Simply remove the USB drive (or power cycle from BMC) and the system should boot right into the OS.
Run the following commands to validate all drivers are working:
nvidia-smi
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GH200 480GB On | 00000009:01:00.0 Off | 0 | | N/A 33C P0 76W / 900W | 1MiB / 97871MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+
2. Ensure the nvidia-peermem kernel module has loaded correctly
lsmod | grep nvidia_peermem
3.