nova: multiprocessor support on Genode/Nova

Genode/Nova currently supports only one CPU. Today we have plenty of CPUs - make use of them.

I have now some early prototypes ready. In the branch issue_814 all adaptations of the Genode base framework and base-nova are kept which are required to create threads on CPUs other then the boot CPU - mainly adding support to create the pagers and server threads on the appropriate remote CPU. In Genode/Core still all threads beside the pager threads (for base-nova there are one per client thread, for base-foc there is only one) are solely on the boot CPU running. So in order that threads on CPUs other then the boot CPU can talk to the services on different CPUs or to Genode/Core services, some kind of cross CPU IPC (xCPU IPC) is required. The approach like done in NUL to force developer of servers to create on all CPUs at least one thread is believed be too strict for Genode by now.

For the xCPU IPC stuff I came up with three prototypically implemented options. In order to be able to get connected to a service running on a different CPU the issuer of the IPC need some proxy thread being on the CPU as the service. (The other option would be to pull the server thread on the CPU of the caller, which is not investigated nor implemented.). The question now is, who provides this proxy thread stuff.

Pointers to source code of the now to be discussed options:

Option 1 - Core emulates xCPU IPC using Genode/Thread abstractions - core_genode_threads
Option 2 - Core emulates xCPU IPC using just Syscall/Thread bindings - core_syscall_threads
Option 3 - Kernel emulates xCPU IPC directly core_kernel

All three branches are based on issue_814 and one of them is required to actually let it fly.

For option 1 and 2 Genode/Core provides this service. If a client can't issue a CPU local IPC, it asks Genode/Core (actually the pager of the client thread) to do the job. Genode/Core would then spawn or reuse a proxy thread on the target CPU and do the IPC on behalf of the client. Option 1 and 2 only differs in regards of code size and whom to account the required resources (proxy thread needs a stack, some capability selectors).

For option 1 the Genode/Threads abstractions are used - that means inside Core some space for the Utcb and stack inside the Genode thread context area are required. (The thread context area is some fixed size virtual region where all threads have their stacks and Utcbs.) At this level Core would have now somehow to implicitly account it to the client (stack can be up to 1MB, 4K for the MessageBuffer), which is currently missing in the implementation. For now in option 1 the proxy threads are created once for all CPUs and then cached and reused.

For option 2 the proxy thread creation is done directly using solely the nova syscall bindings of create_sc/create_ec/create_pt to get the proxy thread constructed. The required stack space (128 Bytes) is part of the pager object, so it gets already accounted to client during pager thread creation. A free virtual region for the utcb is looked up directly in the allocator of core. In contrast to option 1 there is no Thread context area space 'wasted'. Additionally the source code compared to option 1 is (much) shorter (and from my perspective easier to read and understand).

A big issue for option 1/2 remains so far. In order to delegate capabilities during the emulated xCPU IPC - Genode/Core have to receive the capability mappings, to keep them and to delegate or translate it further to the target thread. In Genode there are no information for the low level layer for base-nova, whether actually the capability should be solely translated or solely delegated. In the most cases Core could revoke the received mappings because the target (server) gets the capability translated, but Core don't know by its own about this fact. That means - Core has to keep all received mappings. It has to spend per emulated xCPU IPC with capability delegation some capability selectors and has to track all of them - in order to free it up at end of the lifetime. The question is here whose lifetime - the one of the client thread (easy to implement) or one of the cap session of the client (which one ?, how to get access to from the pager) or of the address space (not so easy, missing interfaces). All this kind of tracking infrastructure is currently not implemented for the prototypes option 1 and 2, but of course would be required for a final solution based on these both options.

For option 3 the same general functionality as for option 1/2 is implemented in the kernel instead of Genode/Core. A separate xcpu_ipc syscall is added as kernel extension which is performed if the direct ipc call fails with bad_cpu kernel error code. The xcpu ipc kernel syscall creates - similar to option 1/2 - a sm and a, ec, sc on the remote cpu and let it run on behalf of the client thread. The client thread gets suspended until the remote proxy thread is done signalled by the semaphore. The remote proxy thread takes the utcb of the suspended client thread as is and issues the IPC call. When the proxy thread returns it wakes up the caller and de-schedules itself. The proxy thread itself actually never runs in user mode. If the client thread wakes up it takes care to initiate the deconstruction of the created proxy resources (ec, sc, sm). (Also caching could be added here as option.). The main advantage of option 3 compared to option 1/2 is that we have not to keep and to track the capability delegations during a xCPU IPC and we don't have potentially up to two additional address space switches per xCPU IPC (client to core, core to server). Additionally we don't have to come up with a new UTCB for each proxy thread as necessary for option 1/2. The disadvantage is of course that it is now in kernel and especially during destruction of the proxy thread we must take some care.

One main question for all options remains - which scheduling parameters should be chosen for the remote CPU. Currently the parameters are taken - as is - from the caller thread and applied to the proxy thread on the target CPU. If you imagine that on different CPUs different priority paradigm exists (so priority of level X on CPU 0 is not equal to priority level X of CPU 1) then you get in trouble. Or imagine for some of the CPUs some scheduling admission is performed and now out of thin air a Scheduling Context of a remote CPU pops up without any kind of admission checks (due to a xCPU IPC) - the admission is useless then.

That's all for the moment, I'm up for your comments and interested in the preferred option.

By now for all three options the affinity test (base/run/affinity.run) works and a simple mp_server (base/run/mp_server.run) which creates RPC Entrypoints (servers so to say) on different CPUs and performs RPC Genode calls triggering xCPU IPCs.

Some minor notes regarding the current implementation of the options:

For option 1/2 a kernel patch is added to get the remote CPU number back, when a IPC call fails. The CPU number is used by Core to get a proxy thread on the appropriate CPU. This patch is not required per se - Core would just have to store next to the portal selectors either directly the CPU number or a pointer to the threads a portal belongs to to find out the remote CPU number.

For option 2 another kernel patch is added to create the proxy thread on a remote CPU without getting a start-up exception. If option 2 would be chosen, this would be removed and replaced by having on every CPU in Genode/Core a thread to respond to the initial start-up exception for global threads (proxy threads for this discussion).

For option 3 all the mentioned kernel patches are not required - solely the actually xCPU IPC handling patch in the kernel is required.

I see three general alternatives for using multiple cores. Let's go through them one by one...

1) Core-Local IPC with one server EC per core This is the model advocated by the current NOVA design, which is also used by NUL/NRE. This model does not use any locks for the entire IPC path and therefore scales excellently. For access to shared data inside the server, the server must perform fine-granular locking internally. Another benefit is that the server EC does not need an SC, which avoids all scheduling-related issues such as convoying and priority inversion. And helping just works. This model requires more resources than the others, but you don't need to create server ECs for all cores, just for those cores where callers exist.

2) Core-Local IPC with a single server EC In contrast to 1), the server EC is shared between clients on different cores. For this model, you need to lock the rendezvous with the server EC, which limits scalability. However, since the server EC is serializing the entries into the server, you don't need server-internal locking. Like in 1), the server EC does not need an SC. The main drawback of this approach, apart from the locking, is that the server EC bounces back and forth between the cores of its callers, which is generating lots of coherency traffic. Whenever the server EC is called from a different core than the last time, cache coherency will migrate the working set (KTCB, UTCB, user stack and all user data) associated with the server thread, which easily costs in excess of 100 cycles per cache line. Helping gets complicated and requires a sophisticated arbitration protocol to decide which client EC from what core can help the server EC at a particular point in time. Helping will cause yet more coherency traffic and due to cross-core helping, no code in the server can ever assume that it runs on a particular core.

3) Cross-Core IPC with a single server EC In this model the server EC has an SC of its own and is therefore bound to a particular core. You still need to lock the rendezvous with the server EC, but at least you don't need to migrate the working set anymore. However, because the server EC now has its own SC, you need to determine the scheduling parameters for that SC. Since you cannot predict for how long and how often the server gets called, this will be difficult. Furthermore, due to having its own SC, the server EC is now subject to the scheduling rules of its core, which means it will compete with other threads on that core. If the server EC gets preempted or runs out of time, its client will stall (and obviously also all other clients queued behind it). This model will cause really bad IPC latency in those cases. Helping is impossible.

The currently proposed Genode approach seems to go in the direction of 3), which means you'll run into the mentioned issues:

Locking the server EC will limit scalability. This will already show up with 4+ cores.
Configuration of the server SC is going to be difficult.
You'll get very jittery IPC whenever the server EC cannot immediately run on its core, or when it is preempted or runs out of time. If there is heavy load on the server core, you will observe IPC latencies that can amount to the sum of multiple time slices.

Apart from that, the creation and destruction of proxy threads will add further overhead, but that is just an implementation issue and not a systematic problem. I would encourage you to revisit the pros and cons of the three listed approaches first.

Personally I rather would go into direction 1. Unfortunately at least some of the following points are not solvable as quick as somebody may think of.

This model requires more resources than the others, but you don't need to create server ECs for all cores, just for those cores where callers exist.

In a setup where you know the set of CPUs callers can come from at boottime of the server - this is ok. In a setup where callers can pop up on CPUs you didn't know in advance you can't do that. On Genode a ordinary user can interactively start new processes and may decide to put it on a CPU your server is not running on. You either prevent this - or - you have to serve it.

For this you may imagine some protocol, where you can boostrap on the fly a corresponding worker thread on the CPU of the caller and have to return back the portal (capability) to the caller. This whole protocol however must be done on a CPU where somebody of the callers address space (or its parent) and the server has already a thread running, otherwise there is no way that they can communicate which each other. You have to tell the server on which CPU to start the worker thread and which capability original was intended to be invoked, so that the server creates a portal representing the same server object as the original portal and so on and so on ...

I would consider this on the fly option rather complicated and it would touch several generic parts of the Genode framework. Whether this kind of change at the end of the day would be accepted just to make base-nova fly and all other base-* platforms don't require this change - I don't know.

Of course all other base-* platform would also benefit at the end in terms of scalability. However - as you know - Genode/ARM is also a heavy working field and there we can't easily spend amounts of memory for all servers to have them - just for the case - on all CPUs.

The other way of doing this multiple worker thread paradigm just for base-nova and hide all this behind the generic Genode base framework is probably not going to work - or at least it isn't so beneficial. So - Genode servers by now are written independent of the kernel they are finally will run on. If you have a Genode server written in a single threaded fashion without any kind of synchronization and now the server will be started on base-nova it would become (behind the scenes) multi threaded. What to do ? You would just have multiple worker threads in the server for base-nova and as first action you would grab a server global lock in the base-nova case. It doesn't sound beneficial at all - especially if the lock would be a nova semaphore where helping also ends.

Just to be clear - the approach as described by my former posting does not prevent the writing/developing of Genode servers as you have described it in your alternative 1. The current approach just make the existing Genode software stack running on base-nova with all its Genode servers as is in a SMP setup (so kind of alternative 3 without having the locking stuff in the direct local IPC path but having allocation/deallocation overhead for cross IPC case.) So, from my understanding your described alternative 1 and 3 (kind of) are supported by Genode.

The next steps towards scalability would/could be to transform servers of Genode into multi threaded where required and on the way maybe to introduce some kind of MP capable Genode Rpc_entrpyoint abstraction which works for all kernels and not solely for nova.

For this you may imagine some protocol, where you can boostrap on the fly a corresponding worker thread on the CPU of the caller and have to return back the portal (capability) to the caller. This whole protocol however must be done on a CPU where somebody of the callers address space (or its parent) and the server has already a thread running, otherwise there is no way that they can communicate which each other. You have to tell the server on which CPU to start the worker thread and which capability original was intended to be invoked, so that the server creates a portal representing the same server object as the original portal and so on and so on ...

To me it sounds like it could work the same way you currently bootstrap Genode on a single core, except that you don't open a session to service S, but you open a session to service S on core C. Where C is not necessarily the physical core number, it may be a virtual core number or something as abstract as "the core I'm on". When you propagate a session establishment request up the tree towards the root, the decision is then either

tell the caller that the service is not going to be available on the requested core (optionally tell the caller what core to use instead)
tell the service provider to spawn the service on the requested core

For other base-systems where you cannot or don't want to provide core-local services in a multi-threaded fashion, you can respond to such session requests with "no need to create a session on this core, because you can use the one that already exists and it's global".

I think ultimately Genode will want to go multi-threaded core-local to some extent. Think of a gigabit NIC with multiple queues. You don't want a single driver thread to drive that NIC, but rather have core-local driver threads that each have their dedicated RX/TX queues. Also, if you're going to NUMA, you want a service thread in each NUMA domain that uses only node-local memory.

It seems, that the version 3 of my original post is preferred. So I updated the https://github.com/alex-ab/genode/tree/issue_814_kernel branch accordingly and put the xCPU IPC patch also to the Genode/Nova kernel branch r3.

genodelabs / genode

nova: multiprocessor support on Genode/Nova #814