TritonDataCenter / rfd

Requests for Discussion
Mozilla Public License 2.0
258 stars 80 forks source link

RFD 114 GPGPU Instance Support in Triton: Discussion #72

Open affixalex opened 6 years ago

affixalex commented 6 years ago

I was really excited to see RFD 114! GPGPU compute on Illumos has been an interest of mine for years. I think the shortest open path on Triton would be to enable PCI-E passthrough to KVM instances running Linux, thereby allowing the use of native Linux GPU drivers. This obviously isn't as good in the limit as a native Illumos solution, but it would allow KVM instances to immediately leverage the huge existing Linux software ecosystem and achieve parity with, e.g., EC2.

http://hypoalex.github.io/jekyll/update/hardware/machine-learning/pci-passthrough-on-illumos-kvm.html

joshwilsdon commented 6 years ago

@hypoalex thanks for the feedback. At this point we're planning to use PCI passthrough with bhyve rather than KVM. I just updated the RFD with some additional details as to current thinking for how this will tie together.

[ Update: Since I didn't open this issue it wasn't in the initial description, but this is for discussion of RFD 114 ]

trentm commented 6 years ago

@joshwilsdon Perhaps a note that support's VM migration scripts will need to have a guard on dealing with assigned_devices?

joshwilsdon commented 6 years ago

@trentm I'll add a note that that's outside the scope of the RFD. Thanks.

jlevon commented 6 years ago

Capturing some of the discussion from last night here. As a lot of it is platform stuff not Triton, I'm not sure if we need a separate RFD, or if we should widen the scope of this one somewhat (perhaps so, as we talking about de-commissioning devices in the RFD questions). The specifics below are of course very subject to change.

Identifying passthrough devices in zonecfg

Part of RFD 121, probably, although we will need this part much sooner than the rest of the items discussed there for bhyve zonecfg. Today, we have match=/dev/ppt3 for example. This isn't very useful as the PPT instances are not a stable property, nor are they sufficient for identifying a particular device.

Instead, we will have something like match=/devices/pci@7b,0/pci8086,6f08@3/pci10b5,8747@0/pci10b5,8747@10/pci10de,1214@0 - i.e. the same format as proposed in this RFD. This in concert with model=passthru is treated specially. Namely, on zone boot, we will use this as a key to identify which /dev/pptX device to add in to the zone device configuration. We will transparently modify the env vars passed down to the boot stub so that it refers to this /dev/pptX device as well.

Note that we still identify specific device paths even for passthru devices we consider fungible such as GPGPUs. This isn't strictly necessary, but is a simplifying assumption.

Reserving passthru devices

We need a mechanism to attach the ppt driver to devices. This must be persistent, and it needs to happen during early boot - while it doesn't matter much for GPU devices, since nothing will attach to them, storage devices will attract a driver, and could even end up with us trying to import a pool stored on it/them, with disastrous results.

The existing method used for testing involves using update_drv to attach ppt devices to all the matching GPGPU PCI device IDs. I don't know if it's been tested, but this should work for attaching to a specific device path too.

The presumption here is that when provisioning a CN with some number of passthru devices, we (who?how?) will identify them as ppt devices, perhaps with something like:

pptadm add /devices/pci@7b,0/pci8086,6f08@3/pci10b5,8747@0/pci10b5,8747@10/pci10de,1214@0

(There are likelier friendlier forms here, and I have no idea how this needs to be for anything up-stack.)

This will add this path to some "PPT database" that is persistent on the CN. Via some mechanism, ppt instances are created and attached to each path found in this DB during early boot. In the case of two matching drivers, I'm not sure how we make sure PPT "wins". Note this always happens with every entry, regardless of whether they're used in an actual VM or not.

We need to process this DB early as Hans mentioned. But if we're before /usr and /zones, it's completely unclear to me as to where we get this DB from in the PXE case? For booting from the key, we can have a grub module with /etc/ppt_aliases or whatever that the kernel sucks in (presumably).

This is simpler if we just need to attach a bit later on, but before we could do anything silly like import a pool. That presumes that Hans can make "boot out an attached - but not used - driver" work reliably.

Perhaps for decommissioning, we would want to identify a device as "dirty" in this database. That is, it needs a clean reset before it can be used again by another VM.

sysinfo

As described in the RFD, we need to provide "Assignable Devices". This seems like it could be derived from the combination of enumerating /dev/ppt* and the zone configs (presumably, ignoring any failed instances). Personally I'd prefer to keep zonecfg the authoritative source of PPT assignments to VMs, but perhaps others feel differently? An alternative is to mark usage in the ppt DB, but this seems like storing the same information twice.

jlevon commented 6 years ago

Some specific questions I had on the RFD:

rmustacc commented 6 years ago

I'm not sure that the platform actually makes sense for determining what a device really is in terms of the variant. I think it actually makes sense to have the platform just be honest about what the device is and let the site-specific policy actually deal with how you want to market and sell that in terms of packages. Imagine we have a cloud that has adopted GPU passthrough at an earlier generation from an on-prem customer who wants to have their own set. If it's baked into the platform and not in Triton, then that means we have to have a universal set of naming. Ultimately, if having the visibility on the CN is useful, we should pass that information to the CN instead of having the CN determine it.

I think for me, the reason that I would make the determination this way is that the variant isn't a property of the compute node in any way. It's a property of something external to it, where as the actual hardware inside it is the property and that's what we should expose.

In terms of passing information about which devices paths are used for ppt, I would probably bootstrap this through bootfs and a boot time module. Then, I would actually consider doing the server class work in triton so we can associate this with hardware profiles.

jlevon commented 6 years ago

Re variants: fair enough.

Re the boot time module, I'm not clear on how you're proposing that works for iPXE case, but then I know nothing at all about how the PXE server is managed.

hrosenfeld commented 6 years ago

Binding a driver to a specific device path using update_drv works, I have tested it. Also bindings to specific paths take precedence over bindings to PCI IDs, so we're good here.

jlevon commented 6 years ago

Thanks Hans, that's great.

I had a separate conversation with rm around PXE: the quick summary there is that it's fine in that case too to expect a grub module with our binding info. It'll be up to the machine definitions as loaded into sdc-booter to provide our PPT bindings for each system as it PXE boots.

That is, we can likely slurp in the PPT bindings at around the same time as we do driver_aliases.

For USB boot we should be able to use the same mechanism. It may still be worth having a pptadm add that populates the file (and maybe adjusts the grub.conf?), for dev/test purposes.

jlevon commented 6 years ago

Thanks for the update, Josh. A few more thoughts:

"pci": [
  {
      "path": "/devices/..."
  }

so you can specify properties, like "disks" does today.

leoj3n commented 5 years ago

Heads up—The location of the OP's link has changed to:

http://affixalex.github.io/jekyll/update/hardware/machine-learning/pci-passthrough-on-illumos-kvm.html

Anyways, I don't mean to hijack this thread, but I'm not too familiar with the SmartOS community (or development) and found this discussion through Google. I hope this isn't a bad place to ask for some clarification, in order to point me and others in the right direction regarding recent developments.

While SmartOS is new to me, I have accomplished GPU passthrough to KVM with Proxmox in the past.

In regards to the GPU passthrough being discussed here, I'm wondering:

I've tried to research the state of GPU passthrough in SmartOS and found:

It looks like there is a development branch called dev-bhyve:

This enhancement serves to collect the dependencies related to merging the dev-bhyve branch into master. (https://smartos.org/bugview/OS-6615)

bhyve ported to IllumOS and combined with Zones (https://youtu.be/90ihmO281GE)

Which could potentially replace the SmartOS port of KVM:

[bhyve] is being ported to SmartOS as a potential replacement for KVM (https://github.com/joyent/rfd/tree/master/rfd/0121)

Furthermore, an earlier reply from @joshwilsdon states:

At this point we're planning to use PCI passthrough with bhyve rather than KVM https://github.com/joyent/rfd/issues/72#issuecomment-362879074

But, according to that one FreeBSD page:

bhyve does not support VGA passthrough devices at this time (https://wiki.freebsd.org/bhyve/pci_passthru)

So, I'm also wondering: Are the people commenting in this issue developing GPU passthrough for "zhyve", which would probably eventually be shared back upstream to FreeBSD bhyve?

As an individual on a budget, I would love to be able to pass through a single GPU to a specific guest while still taking advantage of all the benefits of SmartOS (native ZFS, zones, increased disk space).

Here are some other people mentioning that GPU passthrough would be nice to have with SmartOS:

Regarding GPU passthrough, it looks like the developers of SmartOS have in the past understandably seen "limited business applicability for this in the cloud"; Has this changed recently, perhaps due to the aforementioned "GPGPU" (general-purpose computing on graphics processing units)?

hrosenfeld commented 5 years ago

leoj3n:

PCI device passthrough is supposed to be working in current SmartOS, although we don't currently use it ourselves. During development I have successfully used passthrough of NVMe controllers, SAS HBAs, NICs, and GPUs. The statement about VGA passthru not working in bhyve is correct if you want to use it as VGA console -- but it does work as long as all you want is GPGPU support for accelerated compute.

I've used SmartOS, FreeBSD, and Linux as guests with PCI passthrough. On Linux I've used the NVIDIA drivers and CUDA toolkits to test GPGPU support.

Smithx10 commented 5 years ago

@hrosenfeld Is that with vGPU support or just passing the entire card in? I imagine it's the entire card since I don't think Illumos has native Nvidia drivers. Looking here https://www.nvidia.com/object/unix.html , it seems they have some kind of Unix support. Is vGPU planned or does passthrough solve your problem?

doublerebel commented 5 years ago

@Smithx10 The Solaris driver is here: https://www.nvidia.com/Download/driverResults.aspx/127153/en-us

And the open forum GPGPU question is here: https://devtalk.nvidia.com/default/topic/1036282/solaris/features-for-this-platform/

I have a business need to run on-prem containers which include GPGPU and I would really like it to be SmartOS!! (Edit: as in, I would be happy with LX on SmartOS with passthrough as discussed here.) The security model and VLAN setup would be hard to duplicate otherwise. I'd love this feature in dev even before Triton cloud.

leoj3n commented 5 years ago

Thanks, @hrosenfeld. When you say:

I've used SmartOS, FreeBSD, and Linux as guests with PCI passthrough.

It's not clear to me whether or not you had a VGA console when passing through to those guests. I'll assume you did not.

If you're like me, and want a VGA console in the guest on SmartOS, unfortunately it's looking like this thread is only about using the GPU to crunch numbers (which is probably a much less nuanced thing to do; as Hans stated, generic PCI passthrough is all that's needed to use the NVIDIA drivers and CUDA toolkits to test GPGPU support).

I wonder if bhyve were to eventually support VGA passthrough (there's been some discussion about this), would the bhyve port on SmartOS inherit the ability to do VGA passthrough?

@doublerebel just to clarify my understanding of your GPGPU use case, you want to pass through a single physical graphics card per guest?

@Smithx10 also to clarify, you were asking specifically about passing through a single physical graphics card to multiple guests (via NVIDIA virtual GPU or SR-IOV)?

It's really cool to find out that SmartOS will be able to make use of a GPU (for GPGPU at least). I'll have to look more into GPGPU and see if I can play with it for any workloads that are relevant to me. Thx :)

hrosenfeld commented 5 years ago

@Smithx10 There are no native drivers for SmartOS that could be used for GPGPU applications. PCI passthrough for GPUs will deal with whole PCI functions, which to the best of my knowledge means whole GPU cards for now.

@leoj3n To be clear again, there is no VGA passthrough support. I've never used a VGA console in my bhyve VMs, not emulated and certainly not as passthrough. All we at Joyent cared about was getting GPGPU compute stuff working in Linux guests. As far as I know the bhyve UEFI firmware lacks initialization code for real VGA devices, I have no idea how hard it would be to add that. Perhaps even without that it may be possible to use a GPU for X11 in a Linux guest, but I really have no idea and no way to test.

namgo commented 7 months ago

As far as I know the bhyve UEFI firmware lacks initialization code for real VGA devices

This might explain someone linking https://github.com/Beckhoff/edk2/commits/phab/corvink/gvt-d in a FreeBSD forum, which so far as I understand is intended to provide UEFI support for bhyve guests and relates to some patches in bhyve surrounding PCI passthrough. I suspect that if it was of interest to you or your users at some point in the future, the recent developments in FreeBSD surrounding passthru would be of keen interest.

(edited for brevity... I got excited sorry)