Closed cunnie closed 1 year ago
memory_reservation_locked_to_max
and upgrade_hw_version
to true
. Each entry requires the PCI card's device_id
and vendor_id
. Available in v97+vm_extensions:
- name: gpu
cloud_properties:
pci_passthroughs:
- vendor_id: 0x10de # Nvidia
device_id: 0x1eb8 # Tesla T4
pci_passthroughs
enables Dynamic DirectPath IOWe'd like the ability for BOSH-deployed VMs on vSphere to be able to access hardware such as Nvidia graphics cards in order to enable AI-related applications such as machine learning and large language models (LLMs).
This feature requires vSphere 7.0 (Dynamic DirectPath IO requirement).
When
pci_passthroughs
are configured, the properties,memory_reservation_locked_to_max
andupgrade_hw_version
are additionally totrue
. The former is a requirement for PCI Passthrough, the latter, for Dynamic PCI Passthrough. Jammy stemcells have a too-low hardware version (13); Dynamic PCI Passthrough requires 17+.We chose Dynamic DirectPath IO (DDPIO) instead of the earlier DirectPath IO because, as far as we can tell, it's more flexible: DDPIO merely requires the PCI card's vendor and device IDs, but the non-Dynamic requires the PCI path on the ESXi host (e.g.
0000:17:00.0
), which can constrain the VM placement to a single host.Sample VM extension for Nvidia T4 card:
We bump the vSphere SDK 6.5 → 7.0 when running unit tests to accommodate the
pci_passthrough_spec.rb
, which otherwise would return the error, "NameError: uninitialized constant Vim.Vm.Device.VirtualPCIPassthrough::AllowedDevice" becauseAllowedDevice
only exists in the vSphere 7.0 SDK (it's a component of vSphere's Dynamic DirectPath IO). Note that we have already passed the end of general support (EOGS) for vSphere 6.5 and 6.7 (2022-10-15), and are rapidly approaching end of technical guidance (EOTG) (2023-11-15).Description
Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.
Related PR and Issues
Fixes # (issue)
Impacted Areas in Application
List general components of the application that this PR will affect:
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.
We deployed a VM with an LLM (large language model) to a cluster with two ESXi hosts with an Nvidia Tesla T4 card apiece. We were able to deploy VMs and attach the Nvidia PCI cards and use them.
Test Configuration:
Checklist: