NVIDIA / deepops

Tools for building GPU clusters
BSD 3-Clause "New" or "Revised" License
1.25k stars 326 forks source link

Maas Packer Can't Find EFI Partition to Load GRUB After Imaging with DGX 5.4 iso #1217

Closed kschlichter closed 1 year ago

kschlichter commented 2 years ago

My customer is trying to reimage their DGX A100 cluster to DGX OS 5.4. After pushing the image to the DGX A100 and rebooting, the system loads to a GRUB prompt. Manually loading the kernel works, but I can't find the documentation for pointing the Packer image at the right /efi partition to load the kernel for GRUB. I've seen a similar issue mentioned where a user was able to install DGX OS 5.0.5 and then update from there. My customer tried this, but reports that 5.0.5 failed as well, adding:

I see this in the dgs README, "TODO Next: * kernel parameters in MAAS (w/ tags)".

I can't seem to find any documents on this "TODO" for maas kernel parameters. Could this be the issue? Do you know of any documentation showing this?

The original text is below:

I was able to build and use the image with Maas. The packer image seems to want to use /efi/boot/grub64.efi after the installation, but this doesn't exist (see attached image), which then loads me to the grub command shell.

Looks like /efi is on a separate partition

I can load it manually using:

set root=(md/0)

set prefix=(md/0)/boot/grub

insmod normal

normal

This will then allow me to select the DGX os and boot up to finish a "successful" maas deployment.

I'm sure i can figure a way to "jimmy rig" this to work, but thought I would throw it y'alls way to see if you have a quick and easy solution before i custom your already custom packer image.

biocyberman commented 2 years ago

We are also impacted by this issue.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 60 days with no activity. Please update the issue or it will be closed in 7 days.