cloudfoundry / bosh-vsphere-cpi-release

BOSH vSphere CPI
Apache License 2.0
32 stars 36 forks source link

`pci_passthroughs` enables Dynamic DirectPath IO #370

Closed cunnie closed 1 year ago

cunnie commented 1 year ago

pci_passthroughs enables Dynamic DirectPath IO

We'd like the ability for BOSH-deployed VMs on vSphere to be able to access hardware such as Nvidia graphics cards in order to enable AI-related applications such as machine learning and large language models (LLMs).

This feature requires vSphere 7.0 (Dynamic DirectPath IO requirement).

When pci_passthroughs are configured, the properties, memory_reservation_locked_to_max and upgrade_hw_version are additionally to true. The former is a requirement for PCI Passthrough, the latter, for Dynamic PCI Passthrough. Jammy stemcells have a too-low hardware version (13); Dynamic PCI Passthrough requires 17+.

We chose Dynamic DirectPath IO (DDPIO) instead of the earlier DirectPath IO because, as far as we can tell, it's more flexible: DDPIO merely requires the PCI card's vendor and device IDs, but the non-Dynamic requires the PCI path on the ESXi host (e.g. 0000:17:00.0), which can constrain the VM placement to a single host.

Sample VM extension for Nvidia T4 card:

vm_extensions:
- cloud_properties:
    pci_passthroughs:
    - vendor_id: 0x10de # Nvidia
      device_id: 0x1eb8 # Tesla T4
  name: gpu

We bump the vSphere SDK 6.5 → 7.0 when running unit tests to accommodate the pci_passthrough_spec.rb, which otherwise would return the error, "NameError: uninitialized constant Vim.Vm.Device.VirtualPCIPassthrough::AllowedDevice" because AllowedDevice only exists in the vSphere 7.0 SDK (it's a component of vSphere's Dynamic DirectPath IO). Note that we have already passed the end of general support (EOGS) for vSphere 6.5 and 6.7 (2022-10-15), and are rapidly approaching end of technical guidance (EOTG) (2023-11-15).

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Related PR and Issues

Fixes # (issue)

Impacted Areas in Application

List general components of the application that this PR will affect:

Type of change

Please delete options that are not relevant.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.

We deployed a VM with an LLM (large language model) to a cluster with two ESXi hosts with an Nvidia Tesla T4 card apiece. We were able to deploy VMs and attach the Nvidia PCI cards and use them.

Test Configuration:

Checklist:

cunnie commented 1 year ago

BOSH.IO docs

vm_extensions:
- name: gpu
  cloud_properties:
    pci_passthroughs:
    - vendor_id: 0x10de # Nvidia
      device_id: 0x1eb8 # Tesla T4