Port Qubes to ppc64 [3 bitcoin bounty]

Rspigler commented 6 years ago

QubesOS is the most secure operating system available, by far. However, it unfortunately only runs on the x86 instruction set, which runs on unauditable and insecure firmware. The Power Architecture is a much more secure ISA. Products like the Talos II (edit: and now much more affordable Blackbird) with the Power9 CPU are fully open, with auditable schematics, firmware, and software - and being able to run QubesOS on such devices would be a huge win for the infosec community.

There are various ways to achieve this compatibility, so I thought that this issue could be a way to track them/discuss.

1 - Xen could have a ppc64 port (Raptor Computing Systems has offered free hardware to incentivize) 2 - Using the seL4 microkernel (https://github.com/QubesOS/qubes-issues/issues/3894), which is already looking into supporting the Power Architecture 3 - Qubes' Hypervisor Abstraction Layer (HAL), which utlizes libvirt to support multiple hypervisors, yet currently only supports Xen, could be expanded to support KVM, to run on ppc64.

March 26, 2022: We are now all in agreement for Xen+Power (option 1).

Funds available as of May 7th, 2022: I (Robert Spigler) have 0.35 bitcoin & Blackbird Bundle @leo-lb has pledged 0.8 btc (need to confirm) Total 1.15 btc

@madscientist159 Has offered to do the Xen port for 2 btc (just Xen port; no Qubes integration yet)

Power Foundation has made a statement of support (https://twitter.com/OpenPOWERorg/status/1504112361975730186?s=20), but this needs to be clarified.

We will be moving from Github -> Gitlab for development. (https://gitlab.com/groups/xen-project/-/epics/6)

We have made a Mailing List and Matrix Room: qubes_port@lists.riseup.net; https://lists.riseup.net/www/info/qubes_port https://matrix.to/#/#qubes-port:matrix.org

We have now adopted this milestone approach for this Port: (done here)

Phase 1: 0.65BTC. Build tooling, minimal boot to serial console of a Xen kernel on a single core (no SMP, missing drivers, core locked at 100% power).
(Proposed) Phase 1.5: 0.65BTC (Pricing subject to change due to economic fluctuations): SMP, some driver integration (possible power state management?) required to get a usable system in preparation for Phase 2

I (Robert) donated 0.65 bitcoin out of my remaining 1 bitcoin bounty to fulfill the Phase 1 requirement. See here

@Rudd-O donated the entirety of his bounty (0.5 bitcoin) towards Phase 1.5. He no longer has any remaining pledge, and Phase 1.5 has 0.15 btc left to fulfill. See here

We are still waiting for @leo-lb to re-confirm his pledge.

Last updated May 7th, 2022

Details/History of Funding Below:

Please see the below chronological updates to funding:

1
2
3

In summary, we have a 3 bitcoin bounty, ~~and an additional 0.5 bitcoin remaining for matching funds~~ (deadline passed with 0.5 matching funds filled out of 1 bitcoin matching funds offered - see here). The match offer expired on July 28th 2021.

Details of the bounty are below:

@leo-lb paid @shawnanastasio 0.2 btc out of his 1 bitcoin bounty here: https://github.com/QubesOS/qubes-issues/issues/4318#issuecomment-630972681

@Rspigler (me) paid Shawn 0.5 bitcoin out of his 1.5 bitcoin bounty here. I have also offered hardware (Blackbird mainboard and one 4 core Power9 CPU) for a developer who will use it towards this project. See post here.

@Rudd-O pledged 0.5 bitcoin here (has paid 0).

~~I (Robert) have a remaining 0.5 matching bitcoin offer that expires on July 28th 2021.~~

Last updated: July 31st, 2022

tlaurion commented 3 years ago

@Rspigler : please add links to OP

KVM:

Additional Bitcoin offer: https://github.com/QubesOS/qubes-issues/issues/4318#issuecomment-593809167

Will post links to Xen advancements later on.

Rspigler commented 3 years ago

I am sorry I haven't commented in a while, I have been busy with other projects, and things have been rough with quarantine.

I have just reached out to Mr. Pendarakis and hopefully will hear back soon.

I will edit the OP as suggested

To summarize:

We currently have 1 bitcoin from me (for any solution) here: https://github.com/QubesOS/qubes-issues/issues/4318#issuecomment-463043513 and I offered to match donations up to another 1 bitcoin here: https://github.com/QubesOS/qubes-issues/issues/4318#issuecomment-515831925

We also have 1 bitcoin from @leo-lb here: https://github.com/QubesOS/qubes-issues/issues/4318#issuecomment-482372549

I offered an additional bounty of 0.5 bitcoin and an additional match offer of 0.5 bitcoin here, on the condition that Xen was chosen: https://github.com/QubesOS/qubes-issues/issues/4318#issuecomment-593809167

It has been over a year since I originally offered a matching donation, with no one participating via twitter, reddit, etc. If anyone has any suggestions, I'd appreciate it. At some point, I will have to put a hard expiry on them. It can't last forever.

Edit: leo-lb paid shawnanastasio 0.2 btc out of his 1 bitcoin bounty here: https://github.com/QubesOS/qubes-issues/issues/4318#issuecomment-630972681

Rspigler commented 3 years ago

Updated OP

hanetzer commented 3 years ago

I've been doing a bit of work on xen for ppc64; currently in the process of just getting to the point of compiling a do-nothing binary. Having a bit of issue wrt percpu.h's macro madness, a xen specific struct, and one bit of atomic reading. https://github.com/hanetzer/xen/tree/ppc64 if anyone is willing to take a look/help out I'd appreciate it.

Compile via make XEN_TARGET_ARCH=ppc64 CROSS_COMPILE=powerpc64-linux-gnu-, assuming the standard redhat triple.

shawnanastasio commented 3 years ago

On the KVM front I have fixed one of the last major API incompatibilities I'm aware of between xen-vchan and libkvmchan: https://github.com/shawnanastasio/libkvmchan/issues/20.

I'm now looking into adapting the initial kvm integration work done by @nrgaway for ppc64le and the latest libkvmchan. I've created a new organization here to act as a staging area for any ppc64le/kvm enablement patches before they are finalized and submitted upstream.

shawnanastasio commented 3 years ago

Status Update:

After some patching, I've got my downstream qubes-builder fork (here) building both dom0 and vm fedora32 chroots for ppc64le/kvm. I've used the RPMs produced by the builder to convert a fedora32 installation into a franken-qubes dom0 and am now working through getting kvmchand+qubes*d initialized.

Ready For Review

Of the patches I've written, the following are ready for upstream review:

I will edit this comment to amend the list as more patches are submitted. Currently a few more patches are necessary to get Qubes building on ppc64le/kvm (see the organization), but those aren't upstream ready just yet.

Roadmap

[x] Get qubes-builder producing ppc64le vm/dom0 images
[x] Get Qubes daemons initialized on fedora32 host
[x] Spawn VMs
[x] Ensure vchan connectivity between VMs works
[ ] (In Progress) Port GUI daemon/agent
[ ] Cleanup KVM integration patches and submit upstream
[ ] ???

Update 2/9/20: All qubes dom0 daemons launch successfully Update 2/10/20: Using a modified libvirt template I can now launch VMs via qvm-start! Currently I get to:

[    1.431383] Run /init as init process
Qubes initramfs script here:
modprobe: FATAL: Module xenblk not found in directory /lib/modules/5.4.91-1.fc32.qubes.ppc64le
modprobe: FATAL: Module xen-blkfront not found in directory /lib/modules/5.4.91-1.fc32.qubes.ppc64le
Qubes: Cannot load Xen Block Frontend...
Waiting for /dev/xvda* devices...

which seems to indicate there's an initrd script that qubes inserts that needs to be tweaked for KVM, which I'll work on next.

Update 2/16/20: Updated qubes dracut init script to support booting on KVM! VMs can now be launched and boot to an interactive console. Next step will be spawning qubes daemons and ensuring vchan connectivity.

Update 3/04/20: After many long debugging sessions, qubes daemons (qubesdb and qrexec) now work as expected in VMs launched by qvm-start! On to the GUI daemon/agent.

DemiMarie commented 3 years ago

@shawnanastasio A couple of questions:

How do you plan on implementing networking?
Is there available hardware that does not have a remotely-accessible BMC?

shawnanastasio commented 3 years ago

@shawnanastasio A couple of questions:

Hello,

* How do you plan on implementing networking?

This is something I still have to investigate in depth. As I understand it, one of the major pain points will be finding a way to allow VM<->VM network communication without dom0 in the middle acting as a router. The most appropriate way to implement this on KVM seems to be virtio-vhost-user which should allow two VMs to share a virtio ring buffer which can be used to back a virtio network adapter.

* Is there available hardware that does not have a remotely-accessible BMC?

The most accessible machines I know of are from RaptorCS and they both include an ASpeed BMC that runs OpenBMC firmware (or alternatively a minimal buildroot-based firmware, BangBMC). It is possible to disconnect these devices from the network through various means, from disabling the network within the BMC firmware to even (theoretically) compiling the open source (reverse-engineered) Broadcom NIC firmware without NC-SI support which would prevent the BMC from talking to it at all.

There is also the recently-announced Kestrel project which replaces the ASpeed BMC chip with an FPGA running open HDL which might also become a compelling option for privacy-oriented use-cases.

Rspigler commented 3 years ago

@shawnanastasio Great work!

What are your thoughts re: https://github.com/QubesOS/qubes-issues/issues/4318#issuecomment-515969692

I am accepting the KVM route :) and would like to give Shawn 0.5 bitcoin for his current and future work. Can you please either post or DM me an address?

This removes the increased funding of the Xen port. Also, as warned in my previous comment, I am setting an expiration date on my matching offer. Since I first made my offer on July 28th 2019 with no offers yet, I will be expiring this on the 2 year mark (July 28th 2021).

I will be updating the OP.

Edit: Sigh...format error. See detached message and sig for validation

Bounty_Update.txt Bounty_Update_Sig.txt

shawnanastasio commented 3 years ago

@shawnanastasio Great work!

Thank you!

What are your thoughts re: #4318 (comment)

These seem like reasonable approaches. Initially I will be targeting a traditional dom0 kernel with GUI enabled but no networking, and as few device drivers as possible. All POWER9 systems ship with a very powerful IOMMU that allows extremely granular device passthrough (no need to worry about grouping like on x86_64), so pretty much everything else can be handled by guests. In the future it would be possible to offload even the GUI to a separate guest, this would just require some rework of qubes-gui-* to remove the dom0 assumption.

For sandboxing, QEMU has built-in seccomp sandboxing and fedora ships SELinux rules that lock it down further. I'm not sure if there's any additional hardening we can add here, but it seems that these mitigations are reasonably effective considering the relatively low number of exploitable QEMU guest escape vulnerabilities.

As for crosvm, this seems significantly more difficult than the other approaches, mainly because the ivshmem mechanism that libkvmchan uses for shared memory is not implemented. In addition, libvirt would need significant plumbing to support management of crosvm guests too.

I am accepting the KVM route :) and would like to give Shawn 0.5 bitcoin for his current and future work. Can you please either post or DM me an address?

EDIT: Removed address. See later comment for confirmation of receipt.

Rspigler commented 3 years ago

0.5 bitcoin sent! @shawnanastasio please comment when received
OP updated

shawnanastasio commented 3 years ago

Confirming that I have received the 0.5BTC from @Rspigler! Thanks!

EDIT: By request, here is a signed receipt message using the same GPG key that I sign my commits with (available here):

Rudd-O commented 3 years ago

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I am placing, in solidarity, a 0.5 BTC bounty on this issue being
developed to completion¹, paid to involved people in the same
proportion that Rspigler decides to shell out his bounty.

¹ Completion is defined as Qubes working (with whatever
virtualization technology) on POWER9, where "working" means
a POWER9 system booting, a VM starting within, and Qubes-RPC
working within the VM.
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEUUn9BY4ZnowjgNibWZnBtXu9hDYFAmAq93EACgkQWZnBtXu9
hDbygg//fkpJwpdgA/T4TFOn8QcEBWuQLT2aac3IlOTFIRFWUXYrcg7snSlOdI5L
uPbg4z/FpAZhsLdiTM7mU0pRdBYw6bZ4VXr/elAI0XpCwr7CG12Al88fX5A6LE9D
Mk7FnzJworvyZaKgzEpe3flzp0E93z60zbwhE+I9+yhmN6iF8eJDH2X7zsvfqfFa
Rg1OMuyjOaO1zxsHzbYe6/TzrWYIs5XGz8EsRtyvRb705TDiTby51G8oAuHWyatl
RQBPsEJIuyN8G3hZQjl5XmFDvySpWS28kvcfcmLQ3GE927fzC0xmZbdNq7NPXONB
M+D7vkB1Pb7SUKgECNr2WlrUDTgN/1exim0RTzSKqlluLawDFXjGHnbFXFGPNxbQ
o0R7/Q+L/XwXnyF3jiIe+P+wu2agOU96Ndpk4AgjJyhirJo/kH582Gw7AkV0dfv+
HlTEKOWkDKKAvgSLf8eZg2tLKodkaGswwtPbn4K4WgC9D3EiD+BGLeqZjCNg/C2A
LclFAbg9YI0d18bX3yePXti3L/Y+TarKi6iy6piuOHSQ0ZNY5ZsZdJVQ9VvMlNc8
ZwQqDeVvgTS5QWTBD4mbEOW2SfRVwCDT8kr5BAdCFhwYmi+PSvNp3Wm67vAwMCmQ
ZhfCFNlAdYVn1gKCOdwLCjNBmbc/tVT341sDaHdF1lhhhUZfiC8=
=Ju8g
-----END PGP SIGNATURE-----

Rspigler commented 3 years ago

Woohoo! We're on a roll.

@Rudd-O thank you. Where can I find your public key?

Rudd now has a formal pledge of 0 out of 0.5 bitcoin, filling 0.5 out of 1 of my matching offer.

My pledge increases from 0.5 out of 1 -> 0.5 out of 1.5. My match offer decreases from 1 to 0.5

The total bitcoin bounty is now 3, with 0.7 rewarded and 2.3 remaining.

If my matching offer is filled, there would be another 1 bitcoin available, for a total of 3.3 bitcoin, which is over $150,000 at current prices.

Rspigler commented 3 years ago

:angry: Sorry for this ridiculous formatting. I do not have enough time to figure out why my inline signed comments are erroring out, but uploading files fixes this.

Rudd_Update_Sig.txt

Update OP. Again :) Sharing on twitter to try to get final match.

Rudd-O commented 3 years ago

Woohoo! We're on a roll.

YEAH!

@Rudd-O thank you. Where can I find your public key?

https://rudd-o.com/gpgkey.asc

shawnanastasio commented 3 years ago

Updated status comment. Qubes fedora-32 template VMs now fully boot! Next step is getting the daemons spawned and vchan connectivity verified.

llebout commented 3 years ago

Awesome work @shawnanastasio, thanks a lot! Do you think we should share the bounty also with @nrgaway for KVM-related work? I am ready to send more of my part of the bounty to you but also want to wait see if anyone else comes up to do some things so we can share it more or less fairly.

shawnanastasio commented 3 years ago

Awesome work @shawnanastasio, thanks a lot! Do you think we should share the bounty also with @nrgaway for KVM-related work?

Absolutely. I haven't been in contact with them recently, but if they are interested in receiving part of the bounty I think they should certainly be entitled to it. They provided valuable testing for libkvmchan as well as a large amount of integration work surrounding packaging, systemd units, some patching of qubes utilities for kvm, documentation, etc.

Rspigler commented 3 years ago

@leo-lb @shawnanastasio great point

olivierlambert commented 3 years ago

Hi there,

Olivier from XCP-ng project (member of LF Xen project).

Just to let you know I had multiple calls with OpenPOWER foundation, I'm trying to get real hardware + IPMI to ease development on Xen to PPC. It's not an easy task, but I'll you know when I have some, I suppose it could be also useful to this project at some point :)

llebout commented 3 years ago

@olivierlambert Hello! That's awesome! We however decided to go the KVM route for now because Xen porting efforts to PPC (64-bits) are too much of an effort for us to initially do and then maintain up to reasonable standards. On the other hand QEMU/KVM on PowerPC 64-bits is included in products sold by Red Hat and IBM so has a guarantee of rather good maintenance for the foreseable future.

We however would very much welcome Xen working on PPC because it has advantages compared to QEMU/KVM especially w.r.t. to security. How exactly are you planning to do this and are you interested in the bounty or working in parallel?

olivierlambert commented 3 years ago

We had someone coming out of nowhere, telling us he was interested to dedicated a good amount of time to do this (but asked to have remote hardware access at some point). I'll contact him back soon to know more about details and how to move forward (and some guarantee he will be committed to this).

We can also assist indirectly because we (Vates) started to port Xen to RISC-V, and there is some similarities on the initial work needed to port Xen, regardless the arch.

llebout commented 3 years ago

@olivierlambert That's great! Why did that person contact you specially? Who is it? Is it @hanetzer by any chance?

hanetzer commented 3 years ago

@olivierlambert That's great! Why did that person contact you specially? Who is it? Is it @hanetzer by any chance?

Technically I didn't contact them per se. I was asking questions in #xen-devel about compiler errors relating to xen and they started chatting me up about it. They and andyhpp in #xen-devel have been quite helpful.

olivierlambert commented 3 years ago

@leo-lb Small world. Obviously, it seems that my description returned the only one matching result in the known universe, @hanetzer :laughing:

@hanetzer we couldn't miss that signal on #xendevel from our radar :smiley: (that's not a first for us, I hired the first Xen RISC-V contributor via the mailing list)

hanetzer commented 3 years ago

@leo-lb Small world. Obviously, it seems that my description returned the only one matching result in the known universe, @hanetzer

Honestly the open-power world is quite small at times. I remember when I was having some issue wrt early hostboot/bootblock on a talos (btw you should really join #talos-workstation on freenode, lots of knowledgeable nerds on the subject) and found a blog post about it, and turns out the writer was in the channel. @hanetzer we couldn't miss that signal on #xendevel from our radar (that's not a first for us, I hired the first Xen RISC-V contributor via the mailing list) On that note, is there some staging repo I could look at for this? Looking for examples of how to deal with the massive boilerplate required to get it even to compile.

olivierlambert commented 3 years ago

Let me send you an email to invite you to our Mattermost if it's fine :+1: This way you'll have direct access to Bobby who's doing the port.

hanetzer commented 3 years ago

I've heard of mattermost before but honestly can't recall what it even is, lol. Sure, sounds good.

shawnanastasio commented 3 years ago

We had someone coming out of nowhere, telling us he was interested to dedicated a good amount of time to do this (but asked to have remote hardware access at some point). I'll contact him back soon to know more about details and how to move forward (and some guarantee he will be committed to this).

We can also assist indirectly because we (Vates) started to port Xen to RISC-V, and there is some similarities on the initial work needed to port Xen, regardless the arch.

Cool, It'd be great to see a Xen port materialize for POWER systems and hope it all pans out!

Updated status comment. Qubes fedora-32 template VMs now fully boot! Next step is getting the daemons spawned and vchan connectivity verified.

Updated status comment.

On the KVM front, I have now gotten qrexec and qubesdb running and communicating on both the host and guest! In the process I ended up discovering and fixing a rather serious (and hard to debug) race condition in kvmchand that resulted in deadlocks when starting new VMs sometimes. With that fixed, qubes daemons are functioning as expected!

There are still quite a few kvm hacks I have in my trees that need to be cleaned up and upstreamed, but as I'm quite tired of dealing with systemd units and makefiles, I'll defer bringing those commits up to an upstreamable state until later. For now, I'll move on to getting the GUI daemon working.

@marmarek I recall we discussed potential plans to create a VMM-agnostic interface for qubes-gui's page sharing mechanism, but looking at the upstream qubes repositories it doesn't seem that this has materialized yet. What do you think the best path forward regarding these changes should be? I'm thinking that for now I'll implement a tentative API based off of our prior discussions in libkvmchan and qubes-core-vchan-kvm and fork qubes-gui to use the new interfaces. This will of course require someone to implement the corresponding Xen bits which won't be me since I don't have a Xen setup.

DemiMarie commented 3 years ago

Can we replace QEMU with a different userspace VMM, such as Firecracker?

Another idea I had is to compile the emulation code in QEMU to WebAssembly; this would ensure that a compromised device model cannot compromise the host, since the WebAssembly code can’t escape its sandbox.

llebout commented 3 years ago

@DemiMarie

Can we replace QEMU with a different userspace VMM, such as Firecracker?

Firecracker does not support ppc64[le] so I don't think anyone is going to undertake this.

Another idea I had is to compile the emulation code in QEMU to WebAssembly; this would ensure that a compromised device model cannot compromise the host, since the WebAssembly code can’t escape its sandbox.

Not sure how this is related to this particular issue, current WebAssembly AOT or JIT compilers don't support ppc64[le] as well.

shawnanastasio commented 3 years ago

Can we replace QEMU with a different userspace VMM, such as Firecracker?

I had briefly shared my thoughts on using another VMM here, but essentially I don't believe it's feasible or desirable to switch off of QEMU because none of them support the ivshmem device model that libkvmchan is based on. In addition, it doesn't seem like crosvm or Firecracker support ppc64{,le} at all, so that would also add a significant component that needs porting.

Another idea I had is to compile the emulation code in QEMU to WebAssembly; this would ensure that a compromised device model cannot compromise the host, since the WebAssembly code can’t escape its sandbox.

This is an interesting idea, but QEMU on Fedora is already built with sandboxing via seccomp in addition to a set of rather strict SELinux policies. as @leo-lb said too, none of the non-interpreted WebAssembly runtimes are ported to ppc64le yet either.

In my personal opinion, QEMU with seccomp and SELinux offers an acceptable trade off with regards to security and robust platform and device support.

marmarek commented 3 years ago

Honestly, I'm not sure what would be the best route for the VMM-agnostic interface for page sharing. Here are some hints and thoughts:

providing this API as part of qubes-core-vchan-{kvm,xen} repository IMO makes sense; it isn't strictly part of vchan, but on the other hand, vchan is an inter-vm communication API, and page sharing is related
the API should work independent of open vchan connection (i.e. it should be possible to use this API without established libvchan_t*); another state-holding object is fine; it may make sense to create this as a separate shared library only distributed in the same repository
the consumers for this API would be Xorg driver (gui-agent side) and shmoverride.so preloaded in the Xorg server (gui-daemon side), not gui-agent/gui-daemon processes directly; this is the main reason for the previous remark
mapping a page from another domain ideally should be doable via mmap() from some FD - currently we have shmoverride.so which basically intercepts shm{at,dt} calls so it can do anything, but I'd really like to get rid of it (https://github.com/QubesOS/qubes-issues/issues/5910); the API should provide parameters to mmap() to map specific page(s)
at GUI protocol level we have a flexible message for page sharing parameters; currently those "parameters" are a list of Xen grant refs (one per page), but at the protocol level it is just a opaque blob that is dispatched to the function handling given page sharing method - with the VMM-agnostic API, this should dispatch to the library handling actual sharing method
I imagine the API needs to provide the following methods:
- allocating a page(s) to be shared (Xen API requires a page to be allocated specifically to be shared - the API does not allows sharing a page allocated using standard methods, which TBH makes sense); this operation may also take an argument with whom (target "VM id") the page to be shared; in return it should give both: virtual address where the page is mapped in the process and some kind of reference to be passed to the target VM to map it
- (optionally) allocate several (continuous in virtual address space) pages - this would make sense only if a single shared page(s) reference can cover multiple pages, otherwise it would be a simple wrapper over the above function, with not much value added
- map page allocated and shared by another VM - as explained above, preferably it should not really map it, but return arguments to be used with mmap()
- unmap a page mapped by the above - again, ideally this should really be a cleanup after munmap(), perhaps even no-op in some VMM
- release a page shared to another VM, revoke target VM access to it - I'm not sure what should be the behavior if the target VM has the page still mapped, I see three options: fail, forcefully revoke (will it crash the target VM? that would be undesirable), or defer release until the target VM unmaps the page (not sure who would wait for unmap and release resources in that case)

DemiMarie commented 3 years ago

release a page shared to another VM, revoke target VM access to it - I'm not sure what should be the behavior if the target VM has the page still mapped, I see three options: fail, forcefully revoke (will it crash the target VM? that would be undesirable), or defer release until the target VM unmaps the page (not sure who would wait for unmap and release resources in that case)

Is this actually needed, or can the source VM just unmap the page?

marmarek commented 3 years ago

Is this actually needed, or can the source VM just unmap the page?

In case of Xen - it's enough to unmap the page. But I wouldn't assume (at the API level) that it is simple for every VMM.

shawnanastasio commented 3 years ago

Thanks for sharing your thoughts. I have some follow-up comments:

providing this API as part of qubes-core-vchan-{kvm,xen} repository IMO makes sense; it isn't strictly part of vchan, but on the other hand, vchan is an inter-vm communication API, and page sharing is related

Agreed.

the API should work independent of open vchan connection (i.e. it should be possible to use this API without established libvchan_t*); another state-holding object is fine; it may make sense to create this as a separate shared library only distributed in the same repository

OK, this sounds good too - since the existing shmoverride/xorg code doesn't deal with vchans this makes sense.

For the state-holding object, I guess this would look something like an opaque struct pointer like libvchan does it. I'm thinking something like shmem_handle_t.

the consumers for this API would be Xorg driver (gui-agent side) and shmoverride.so preloaded in the Xorg server (gui-daemon side), not gui-agent/gui-daemon processes directly; this is the main reason for the previous remark

ACK

mapping a page from another domain ideally should be doable via mmap() from some FD - currently we have shmoverride.so which basically intercepts shm{at,dt} calls so it can do anything, but I'd really like to get rid of it (#5910); the API should provide parameters to mmap() to map specific page(s)

OK, so the API should provide something like this then?

struct shmem_mmap_params {
    size_t length;
    int prot;
    int flags;
    int fd;
    off_t offset; 
};

void shmem_get_mmap_parameters(shmem_handle_t *handle, struct shmem_mmap_params *out);

This API would easily allow kvmchand to provide backing for multiple shared memory regions in a single ivshmem device which is more efficient, but it would have the side effect that API consumers would be able to snoop on others' shared memory by ignoring the parameters we provide and using their own offset. Would this be acceptable? If the only programs that can consume this API are trusted and this is enforced somehow (e.g. by making it root-only), then it seems like it perhaps would be.

at GUI protocol level we have a flexible message for page sharing parameters; currently those "parameters" are a list of Xen grant refs (one per page), but at the protocol level it is just a opaque blob that is dispatched to the function handling given page sharing method - with the VMM-agnostic API, this should dispatch to the library handling actual sharing method

OK, this makes sense but I'll have to dig into the implementation more to fully understand what this would entail.

I imagine the API needs to provide the following methods:

allocating a page(s) to be shared (Xen API requires a page to be allocated specifically to be shared - the API does not allows sharing a page allocated using standard methods, which TBH makes sense); this operation may also take an argument with whom (target "VM id") the page to be shared; in return it should give both: virtual address where the page is mapped in the process and some kind of reference to be passed to the target VM to map it

OK, this maps very closely to the kvmchand/ivshmem semantics too. We don't have a mechanism for sharing arbitrary normally-allocated pages but by having kvmchand allocate the pages in a memfd, processes can receive the memfd over the kvmchand socket and map them from there.

As for the reference type returned, from the kvmchand angle I'm thinking of just implementing this by allocating a unique shared memory ID for two domains that will have access to the region - much like the port numbers in the vchan API. Since the goal will be to send this reference to a target dom over a vchan, I guess it should be some basic type that can be passed by value and directly written to a vchan ringbuf, something like an int typedef.

From the Xen aspect, I assume this reference will have to consist of the Xen grant refs you mentioned earlier? If that's the case then it seems the reference type will have to be arbitrarily sized to support a varying number of grant refs if the API allows allocating more than one page at a time, which doesn't seem feasible.

(optionally) allocate several (continuous in virtual address space) pages - this would make sense only if a single shared page(s) reference can cover multiple pages, otherwise it would be a simple wrapper over the above function, with not much value added

This is what I was wondering in the previous point too. With my proposed unique ID system, implementing a variable-sized memory reference is no problem. Is there a way to do this in Xen without having one grant ref per page? If not then it probably doesn't make sense to bother with this.

map page allocated and shared by another VM - as explained above, preferably it should not really map it, but return arguments to be used with mmap()

ACK

unmap a page mapped by the above - again, ideally this should really be a cleanup after munmap(), perhaps even no-op in some VMM

ACK

release a page shared to another VM, revoke target VM access to it - I'm not sure what should be the behavior if the target VM has the page still mapped, I see three options: fail, forcefully revoke (will it crash the target VM? that would be undesirable), or defer release until the target VM unmaps the page (not sure who would wait for unmap and release resources in that case)

This seems a bit tricky as kvmchand would have no real way of knowing if the pages are still in use by the other end, so it would probably end up being a no-op.

marmarek commented 3 years ago

For the state-holding object, I guess this would look something like an opaque struct pointer like libvchan does it. I'm thinking something like shmem_handle_t.

Yes, like this. Of course if necessary - it may be also that KVM version of this API doesn't need any internal state. But the API should still allow an implementation to have some.

OK, so the API should provide something like this then?

struct shmem_mmap_params {
    size_t length;
    int prot;
    int flags;
    int fd;
    off_t offset; 
};

void shmem_get_mmap_parameters(shmem_handle_t *handle, struct shmem_mmap_params *out);

This should resolve received "shared page id" into mmap() params, possibly doing some preparation first (for example creating that fd). I mean - the function needs an input parameter too. Otherwise you have no way to tell which shared page you want to map.

As for the reference type returned, from the kvmchand angle I'm thinking of just implementing this by allocating a unique shared memory ID for two domains that will have access to the region - much like the port numbers in the vchan API. Since the goal will be to send this reference to a target dom over a vchan, I guess it should be some basic type that can be passed by value and directly written to a vchan ringbuf, something like an int typedef.

Yes, something like int typedef would work (but make sure to specify its length explicitly).

This is what I was wondering in the previous point too. With my proposed unique ID system, implementing a variable-sized memory reference is no problem. Is there a way to do this in Xen without having one grant ref per page? If not then it probably doesn't make sense to bother with this.

Indeed in case of Xen it requires one grant ref (AFAIR uint32_t) per page. But if it is easily possible to have one "share id" covering multiple pages, I wouldn't dismiss this option that easily. With one id per (4k) page and 4K display, mere list of those IDs for the whole screen is about 32MB. Leaving alone transferring that list (it's larger than vchan buffer size we use), mapping every page separately is significant number of system calls that are affecting performance. So, if an API for given VMM can share multiple pages under a single id (and then map them at once on the other side), that would be a significant win performance-wise. If not, an application can fallback to share-id per page (I guess by simply iterating the former function).

This seems a bit tricky as kvmchand would have no real way of knowing if the pages are still in use by the other end, so it would probably end up being a no-op.

If that doesn't leak resources (if pages are freed when both sides unmap it), it should be fine I think.

shawnanastasio commented 3 years ago

For the state-holding object, I guess this would look something like an opaque struct pointer like libvchan does it. I'm thinking something like shmem_handle_t.

Yes, like this. Of course if necessary - it may be also that KVM version of this API doesn't need any internal state. But the API should still allow an implementation to have some.

ACK

OK, so the API should provide something like this then?
struct shmem_mmap_params {
    size_t length;
    int prot;
    int flags;
    int fd;
    off_t offset; 
};

void shmem_get_mmap_parameters(shmem_handle_t *handle, struct shmem_mmap_params *out);
This should resolve received "shared page id" into mmap() params, possibly doing some preparation first (for example creating that fd). I mean - the function needs an input parameter too. Otherwise you have no way to tell which shared page you want to map.

Oh, ok. I was thinking that each shmem_handle_t would refer to a given region but now that you mention it, separate region descriptors makes sense.

As for the reference type returned, from the kvmchand angle I'm thinking of just implementing this by allocating a unique shared memory ID for two domains that will have access to the region - much like the port numbers in the vchan API. Since the goal will be to send this reference to a target dom over a vchan, I guess it should be some basic type that can be passed by value and directly written to a vchan ringbuf, something like an int typedef.

Yes, something like int typedef would work (but make sure to specify its length explicitly).

Sure, I'm probably going to use a uint32_t internally, but the size should probably not be a part of the API as IMO callers should just be expected to use sizeof on the reference.

This is what I was wondering in the previous point too. With my proposed unique ID system, implementing a variable-sized memory reference is no problem. Is there a way to do this in Xen without having one grant ref per page? If not then it probably doesn't make sense to bother with this.

Indeed in case of Xen it requires one grant ref (AFAIR uint32_t) per page. But if it is easily possible to have one "share id" covering multiple pages, I wouldn't dismiss this option that easily. With one id per (4k) page and 4K display, mere list of those IDs for the whole screen is about 32MB. Leaving alone transferring that list (it's larger than vchan buffer size we use), mapping every page separately is significant number of system calls that are affecting performance. So, if an API for given VMM can share multiple pages under a single id (and then map them at once on the other side), that would be a significant win performance-wise. If not, an application can fallback to share-id per page (I guess by simply iterating the former function).

Ok, that's a reasonable point. I'll implement multi-page regions for libkvmchan then and defer to others for the Xen side.

This seems a bit tricky as kvmchand would have no real way of knowing if the pages are still in use by the other end, so it would probably end up being a no-op.

If that doesn't leak resources (if pages are freed when both sides unmap it), it should be fine I think.

Yep, sounds good.

Rspigler commented 3 years ago

Thank you everyone for all the great work on this!

jaesharp commented 3 years ago

Is there a guide and/or some notes on building a QubesOS image for the Talos II and/or Blackbird systems and testing the KVM port? I'd like to help out by testing and debugging but I'm afraid I'm not too familiar with the QubesOS development/porting workflows.

shawnanastasio commented 3 years ago

Is there a guide and/or some notes on building a QubesOS image for the Talos II and/or Blackbird systems and testing the KVM port? I'd like to help out by testing and debugging but I'm afraid I'm not too familiar with the QubesOS development/porting workflows.

The port is still in a very early stage, so there's no solid documentation yet (and only the most basic core functionality works so it's not very useful yet).

If you want to play around with stuff, the rough steps are:

Install Fedora 32 on your host
Clone my qubes-builder fork
Build the qubes-vm, qubes-dom0, template targets
Install the built dom0 packages to your Fedora 32 host using this script
Follow the steps in the HOST section of @nrgaway's documentation
Follow the steps in the VIRTUAL MACHINE section of the documentation
Ensure all the necessary daemons are running with this script
Launch a vm with qvm-start and cross your fingers :)

Let me know if you (or anybody else) end up trying this out. There are tons of rough edges and some of these steps will certainly fail and reveal issues that need to be fixed.

shawnanastasio commented 3 years ago

Hi all, just a short status update as to what I've been up to.

I've been implementing the proposed shared memory API in libkvmchan and just recently got basic page allocation and mapping working (https://github.com/shawnanastasio/libkvmchan/commit/acedc5929800f1d0e31b780f75e6fc45ec0f756c)! Now that the groundwork is laid I'm going to continue implementing the full API then hopefully I can start working on porting qubes-gui.

shawnanastasio commented 3 years ago

I've run into a slight roadblock and would appreciate some feedback on how to proceed, cc @marmarek.

My implementation of shared memory regions relies on creating large ivshmem memory pools from which individual shared pages are allocated. This approach has numerous benefits over allocating one ivshmem pool per shared memory mapping, including speed, memory efficiency (each pool must have a size that is a power of 2, so servicing a single mapping will almost always require rounding up and wasting space), and others.

The problem is that the linux VFIO framework which is used for mapping the ivshmem devices from guest userspace does not allow passing an offset to mmap (it uses the offset field for something else). This means that it is only possible to mmap these pools starting at the beginning. We can work around this by always mapping the pool from the beginning and munmap-ing the undesired parts after, but this means that the proposed mmap API can't be implemented - the library must be responsible for mapping the pages and returning the correct pointer to the user.

What are your thoughts on this? Do you think we can get away with an API that simply returns a pointer to the mapped region instead of an mmap parameter struct? If not, I will likely have to write a custom linux driver for the ivshmem devices sidestepping issues with VFIO entirely, but this is not a route I'm eager to go down if alternatives are possible.

marmarek commented 3 years ago

What are your thoughts on this? Do you think we can get away with an API that simply returns a pointer to the mapped region instead of an mmap parameter struct?

This is unfortunate, but acceptable. In fact, Xen exposes similar abstraction - we get to call xengnttab_map_grant_refs(), even if internally it uses just mmap. So, this will work, but we'll be unable to make https://github.com/QubesOS/qubes-issues/issues/5910 happen.

shawnanastasio commented 3 years ago

This is unfortunate, but acceptable. In fact, Xen exposes similar abstraction - we get to call xengnttab_map_grant_refs(), even if internally it uses just mmap. So, this will work, but we'll be unable to make #5910 happen.

Ok, good to know. It's unfortunate that this API won't be ideal, but it at least allows forward progress for the moment.

Rspigler commented 3 years ago

What are any possible downsides of this?

Thank you for your continued work!

shawnanastasio commented 3 years ago

What are any possible downsides of this?

Thank you for your continued work!

There are no user-facing consequences, it just means that the internal mechanism qubes uses for handing shared memory to the X server will need to tolerate not having control over mapping the memory, which is already the case.

Rspigler commented 3 years ago

Surprised we don't use Wayland

DemiMarie commented 3 years ago

@Rspigler That is planned :)

QubesOS / qubes-issues