Open affixalex opened 6 years ago
@hypoalex thanks for the feedback. At this point we're planning to use PCI passthrough with bhyve rather than KVM. I just updated the RFD with some additional details as to current thinking for how this will tie together.
[ Update: Since I didn't open this issue it wasn't in the initial description, but this is for discussion of RFD 114 ]
@joshwilsdon Perhaps a note that support's VM migration scripts will need to have a guard on dealing with assigned_devices?
@trentm I'll add a note that that's outside the scope of the RFD. Thanks.
Capturing some of the discussion from last night here. As a lot of it is platform stuff not Triton, I'm not sure if we need a separate RFD, or if we should widen the scope of this one somewhat (perhaps so, as we talking about de-commissioning devices in the RFD questions). The specifics below are of course very subject to change.
Part of RFD 121, probably, although we will need this part much sooner than the rest of the items discussed there for bhyve zonecfg. Today, we have match=/dev/ppt3
for example. This isn't very useful as the PPT instances are not a stable property, nor are they sufficient for identifying a particular device.
Instead, we will have something like match=/devices/pci@7b,0/pci8086,6f08@3/pci10b5,8747@0/pci10b5,8747@10/pci10de,1214@0
- i.e. the same format as proposed in this RFD. This in concert with model=passthru
is treated specially. Namely, on zone boot, we will use this as a key to identify which /dev/pptX
device to add in to the zone device configuration. We will transparently modify the env vars passed down to the boot stub so that it refers to this /dev/pptX device as well.
Note that we still identify specific device paths even for passthru devices we consider fungible such as GPGPUs. This isn't strictly necessary, but is a simplifying assumption.
We need a mechanism to attach the ppt driver to devices. This must be persistent, and it needs to happen during early boot - while it doesn't matter much for GPU devices, since nothing will attach to them, storage devices will attract a driver, and could even end up with us trying to import a pool stored on it/them, with disastrous results.
The existing method used for testing involves using update_drv
to attach ppt devices to all the matching GPGPU PCI device IDs. I don't know if it's been tested, but this should work for attaching to a specific device path too.
The presumption here is that when provisioning a CN with some number of passthru devices, we (who?how?) will identify them as ppt devices, perhaps with something like:
pptadm add /devices/pci@7b,0/pci8086,6f08@3/pci10b5,8747@0/pci10b5,8747@10/pci10de,1214@0
(There are likelier friendlier forms here, and I have no idea how this needs to be for anything up-stack.)
This will add this path to some "PPT database" that is persistent on the CN. Via some mechanism, ppt instances are created and attached to each path found in this DB during early boot. In the case of two matching drivers, I'm not sure how we make sure PPT "wins". Note this always happens with every entry, regardless of whether they're used in an actual VM or not.
We need to process this DB early as Hans mentioned. But if we're before /usr and /zones, it's completely unclear to me as to where we get this DB from in the PXE case?
For booting from the key, we can have a grub module with /etc/ppt_aliases
or whatever that the kernel sucks in (presumably).
This is simpler if we just need to attach a bit later on, but before we could do anything silly like import a pool. That presumes that Hans can make "boot out an attached - but not used - driver" work reliably.
Perhaps for decommissioning, we would want to identify a device as "dirty" in this database. That is, it needs a clean reset before it can be used again by another VM.
As described in the RFD, we need to provide "Assignable Devices". This seems like it could be derived from the combination of enumerating /dev/ppt*
and the zone configs (presumably, ignoring any failed
instances). Personally I'd prefer to keep zonecfg the authoritative source of PPT assignments to VMs, but perhaps others feel differently? An alternative is to mark usage in the ppt DB, but this seems like storing the same information twice.
Some specific questions I had on the RFD:
should sysinfo be reporting all PPT devices not just those that are available/assignable? From the commentary on "gpus_available" it sounds like Triton itself will figure out from server.vms which of the ppt devices are actually used by a deployed zone, which would imply so. Certainly Triton seems a better place to understand "failed" state VMs etc?
are we safe against a TOCTOU race: i.e. two racing provisions that attempt to pick the same assignable device? I don't know if Triton allows simultaneous provisioning.
is CNAPI definitely the right place to host the PCI->variant conversion? It seems useful on a CN to be able to list all PPTs by variant, and it seems like the platform is the right place for understanding what a device really "is".
"gpus" isn't general enough for other forms of passthru. Would we get a "nvmes" too? Or should this be "passthru_dev_count" ?
I'm not sure that the platform actually makes sense for determining what a device really is in terms of the variant. I think it actually makes sense to have the platform just be honest about what the device is and let the site-specific policy actually deal with how you want to market and sell that in terms of packages. Imagine we have a cloud that has adopted GPU passthrough at an earlier generation from an on-prem customer who wants to have their own set. If it's baked into the platform and not in Triton, then that means we have to have a universal set of naming. Ultimately, if having the visibility on the CN is useful, we should pass that information to the CN instead of having the CN determine it.
I think for me, the reason that I would make the determination this way is that the variant isn't a property of the compute node in any way. It's a property of something external to it, where as the actual hardware inside it is the property and that's what we should expose.
In terms of passing information about which devices paths are used for ppt, I would probably bootstrap this through bootfs and a boot time module. Then, I would actually consider doing the server class work in triton so we can associate this with hardware profiles.
Re variants: fair enough.
Re the boot time module, I'm not clear on how you're proposing that works for iPXE case, but then I know nothing at all about how the PXE server is managed.
Binding a driver to a specific device path using update_drv works, I have tested it. Also bindings to specific paths take precedence over bindings to PCI IDs, so we're good here.
Thanks Hans, that's great.
I had a separate conversation with rm around PXE: the quick summary there is that it's fine in that case too to expect a grub module with our binding info. It'll be up to the machine definitions as loaded into sdc-booter
to provide our PPT bindings for each system as it PXE boots.
That is, we can likely slurp in the PPT bindings at around the same time as we do driver_aliases
.
For USB boot we should be able to use the same mechanism. It may still be worth having a pptadm add
that populates the file (and maybe adjusts the grub.conf?), for dev/test purposes.
Thanks for the update, Josh. A few more thoughts:
the "pptadm add" is really only for dev/test, and isn't complete (I think it's too risky for us to try to modify the USB grub menu to include the PPT module for one). The RFD should probably make that clear - and that the main mechanism will be for machine profiles to populate the module via sdc-booter. So in the "Provisioning" part there's no pptadm equivalent: when the CN was built, that's when the PPT devices are bound, rather than something that's done after building the CN. But perhaps I'm misunderstanding workflow.
I'm not clear as to why we need to list all PCI devices. Given the above, we should be in a situation normally where all PPT devices are pre-configured. It's far easier for us to report only ppt-bound devices and their device path (in, say, pptadm list)
vmadm - I was expecting something more like:
"pci": [
{
"path": "/devices/..."
}
so you can specify properties, like "disks" does today.
Heads up—The location of the OP's link has changed to:
Anyways, I don't mean to hijack this thread, but I'm not too familiar with the SmartOS community (or development) and found this discussion through Google. I hope this isn't a bad place to ask for some clarification, in order to point me and others in the right direction regarding recent developments.
While SmartOS is new to me, I have accomplished GPU passthrough to KVM with Proxmox in the past.
In regards to the GPU passthrough being discussed here, I'm wondering:
I've tried to research the state of GPU passthrough in SmartOS and found:
It looks like there is a development branch called dev-bhyve
:
This enhancement serves to collect the dependencies related to merging the dev-bhyve branch into master. (https://smartos.org/bugview/OS-6615)
bhyve ported to IllumOS and combined with Zones (https://youtu.be/90ihmO281GE)
Which could potentially replace the SmartOS port of KVM:
[bhyve] is being ported to SmartOS as a potential replacement for KVM (https://github.com/joyent/rfd/tree/master/rfd/0121)
Furthermore, an earlier reply from @joshwilsdon states:
At this point we're planning to use PCI passthrough with bhyve rather than KVM https://github.com/joyent/rfd/issues/72#issuecomment-362879074
But, according to that one FreeBSD page:
bhyve does not support VGA passthrough devices at this time (https://wiki.freebsd.org/bhyve/pci_passthru)
So, I'm also wondering: Are the people commenting in this issue developing GPU passthrough for "zhyve", which would probably eventually be shared back upstream to FreeBSD bhyve?
As an individual on a budget, I would love to be able to pass through a single GPU to a specific guest while still taking advantage of all the benefits of SmartOS (native ZFS, zones, increased disk space).
Here are some other people mentioning that GPU passthrough would be nice to have with SmartOS:
Regarding GPU passthrough, it looks like the developers of SmartOS have in the past understandably seen "limited business applicability for this in the cloud"; Has this changed recently, perhaps due to the aforementioned "GPGPU" (general-purpose computing on graphics processing units)?
leoj3n:
PCI device passthrough is supposed to be working in current SmartOS, although we don't currently use it ourselves. During development I have successfully used passthrough of NVMe controllers, SAS HBAs, NICs, and GPUs. The statement about VGA passthru not working in bhyve is correct if you want to use it as VGA console -- but it does work as long as all you want is GPGPU support for accelerated compute.
I've used SmartOS, FreeBSD, and Linux as guests with PCI passthrough. On Linux I've used the NVIDIA drivers and CUDA toolkits to test GPGPU support.
@hrosenfeld Is that with vGPU support or just passing the entire card in? I imagine it's the entire card since I don't think Illumos has native Nvidia drivers. Looking here https://www.nvidia.com/object/unix.html , it seems they have some kind of Unix support. Is vGPU planned or does passthrough solve your problem?
@Smithx10 The Solaris driver is here: https://www.nvidia.com/Download/driverResults.aspx/127153/en-us
And the open forum GPGPU question is here: https://devtalk.nvidia.com/default/topic/1036282/solaris/features-for-this-platform/
I have a business need to run on-prem containers which include GPGPU and I would really like it to be SmartOS!! (Edit: as in, I would be happy with LX on SmartOS with passthrough as discussed here.) The security model and VLAN setup would be hard to duplicate otherwise. I'd love this feature in dev even before Triton cloud.
Thanks, @hrosenfeld. When you say:
I've used SmartOS, FreeBSD, and Linux as guests with PCI passthrough.
It's not clear to me whether or not you had a VGA console when passing through to those guests. I'll assume you did not.
If you're like me, and want a VGA console in the guest on SmartOS, unfortunately it's looking like this thread is only about using the GPU to crunch numbers (which is probably a much less nuanced thing to do; as Hans stated, generic PCI passthrough is all that's needed to use the NVIDIA drivers and CUDA toolkits to test GPGPU support).
I wonder if bhyve
were to eventually support VGA passthrough (there's been some discussion about this), would the bhyve
port on SmartOS inherit the ability to do VGA passthrough?
@doublerebel just to clarify my understanding of your GPGPU use case, you want to pass through a single physical graphics card per guest?
@Smithx10 also to clarify, you were asking specifically about passing through a single physical graphics card to multiple guests (via NVIDIA virtual GPU or SR-IOV)?
It's really cool to find out that SmartOS will be able to make use of a GPU (for GPGPU at least). I'll have to look more into GPGPU and see if I can play with it for any workloads that are relevant to me. Thx :)
@Smithx10 There are no native drivers for SmartOS that could be used for GPGPU applications. PCI passthrough for GPUs will deal with whole PCI functions, which to the best of my knowledge means whole GPU cards for now.
@leoj3n To be clear again, there is no VGA passthrough support. I've never used a VGA console in my bhyve VMs, not emulated and certainly not as passthrough. All we at Joyent cared about was getting GPGPU compute stuff working in Linux guests. As far as I know the bhyve UEFI firmware lacks initialization code for real VGA devices, I have no idea how hard it would be to add that. Perhaps even without that it may be possible to use a GPU for X11 in a Linux guest, but I really have no idea and no way to test.
As far as I know the bhyve UEFI firmware lacks initialization code for real VGA devices
This might explain someone linking https://github.com/Beckhoff/edk2/commits/phab/corvink/gvt-d in a FreeBSD forum, which so far as I understand is intended to provide UEFI support for bhyve guests and relates to some patches in bhyve surrounding PCI passthrough. I suspect that if it was of interest to you or your users at some point in the future, the recent developments in FreeBSD surrounding passthru would be of keen interest.
(edited for brevity... I got excited sorry)
I was really excited to see RFD 114! GPGPU compute on Illumos has been an interest of mine for years. I think the shortest open path on Triton would be to enable PCI-E passthrough to KVM instances running Linux, thereby allowing the use of native Linux GPU drivers. This obviously isn't as good in the limit as a native Illumos solution, but it would allow KVM instances to immediately leverage the huge existing Linux software ecosystem and achieve parity with, e.g., EC2.
http://hypoalex.github.io/jekyll/update/hardware/machine-learning/pci-passthrough-on-illumos-kvm.html