TUD-OS / NRE

NOVA runtime environment (official branch)
GNU General Public License v2.0
33 stars 12 forks source link

Virtual IRQs (or Retiring Kernel Semaphores) #25

Closed blitz closed 11 years ago

blitz commented 11 years ago

Disclaimer: This is only meant as discussion and is not ready to be merged.

I have implemented virtual IRQs for NOVA. The implementation is influenced by Fiasco.OC's virtual IRQs, but differs in some important ways. The goal is to have a simple event notification system in the kernel that supersedes kernel semaphores and simplifies vancouver.

Comments are welcome.

New kernel mechanisms

Let me outline what the implementation currently provides: On creation time vIRQs are bound to one EC (the handler), which can wait (block) for events on all virtual IRQs that are connected to it.

vIRQ objects can be freely delegated and the only operation on them is to trigger them. Events are distinguished by sticking an event bitmask to each vIRQ object (also on creation time). If the vIRQ is triggered, this bitmask is OR-ed into the handling EC's UTCB. I repurposed utcb.tls (now named utcb.event), because no one uses it.

ECs can voluntarily block to wait for events. If events are pending, ECs will wakeup (or not block at all). The handling EC is then responsible to zero the relevant bits in its event bit mask in its UTCB. The block operation takes an additional mask parameter to limit wakeups to specific events.

There is also experimental support for adding another EC to the virtual IRQ that gets recalled whenever the handler EC is not actively blocking. Those two ECs may be the same or not. This feature is useful to implement asynchronous timeouts or events in Vancouver that need to be handled in VCPU context. This removes one kernel entry and context switch in Vancouver's usual IRQ injection path.

The code is in my NOVA repository in the virq branch. The syscall interface is not very beautiful at the moment and there is minor code duplication with the semaphore implementation, but the current goal is to make tracking upstream NOVA feasible for me.

New userspace mechanisms

I implemented a userspace mutex built on top of virtual IRQs. It survives stress testing and may be used in all places where NRE currently uses userspace semaphores. I will try out the latter in the coming days and see what happens.

The implementation is in include/kobj/Mutex.h.

Pros/Cons

Pro:

Contra:

nfeske commented 11 years ago

Very interesting! This mechanism seems to almost fit the requirements for Genode's signalling API:

https://github.com/genodelabs/genode/blob/master/base/include/base/signal.h

This API uses a so-called Signal_receiver as the object that is able to receives signals. Using the function Signal_receiver::manage(signal_context), the caller obtains a capability for the specified signal handler (similar to how an RPC entrypoint returns an object capability when associating an RPC object with the entrypoint). The returned signal-context capability can be passed around. Anyone in possession of such a capability can use the Signal_transmitter::submit() function to trigger a signal. The signal receiver can wait for signals by calling Signal_receiver::wait_for_signal(). There is also a way to probe for pending signals w/o blocking. When a signal occurs, the signal receiver provides the signal context and the number of pending signals of occurred type.

I say "almost" because the following observations:

From these observations, the following questions follow:

In Genode's API, each wait_for_signal call returns one signal (vIRQ) type at a time. If multiple signals of different types are triggered at the same time, multiple subsequenting calls of wait_for_irq are needed to obtain them. To me, this looks like the most significant difference to your mechanism where you represent different vIRQs by different bits.

Btw, there exists an implementation of Genode's signalling API at kernel-level in base-hw platform, created by @m-stein. If you are interested in having a look, maybe Martin can provide you with the relevant pointers?

It would be really cool to have Genode's Signalling API supported by NOVA as a first-class kernel mechanism. Right now, we have to jump through a number of hoops to implement the correct semantics on top of the current NOVA interface.

blitz commented 11 years ago

You seem to actually require the two things, I thought no one uses. ;)

The current implementation in the kernel is very simple, because events do not need to be queued, but naturally merge in the machine-word sized event bitmask that is directly visible in the handling ECs UTCB. This has the nice property that no IPIs need to be sent, when a handler polls its events field and that even when multiple events trigger the handler does not need to reenter the kernel just to fetch a new event. It only has to enter the kernel to block, if no event bits are set.

The obvious disadvantage is that after you exceed the number of bits, things get hairy. But since events are not counted, it is easy to reuse bits for multiple event sources. Are there situations where you wait for a large number of different kinds of events in Genode?

Regarding counting, the mutex implementation does not need counting and as far as I can see, most other uses involve something producer-consumer like, which doesn't need counting as well. Do you have an example?

The problem with badges is that you basically have sort of an asynchronous IPC operation with them. Messages need to be queued at the handling EC and you would have to implement something more like a receive_async_ipc system call than the very primitive block_if_no_event_pending that I wrote. It would naturally allow for counting events, though. As far as I understood it, this is the way IRQ objects are implemented in Fiasco.OC.

Nils-TUD commented 11 years ago

I think, if we change the concept slightly, we could have badges and counting. And we could even have the enter-kernel-once property at the same time:

If I haven't missed something, the only problem I see is the race-condition when checking for events. We might have to lock the entire list while doing that to prevent that new Vis are added or removed in the meanwhile. Although I'm not sure how problematic that is. But first we should make sure that it is really worth it to have badges and counting :)

nfeske commented 11 years ago

@blitz

It is true that for the most part, wake-up semantics for signals (w/o the counter) suffices. But there are examples where the counter is useful:

Even though both examples could possibly be implemented differently, the important point is that this part of the API is actually used because it is the most practical and intuitive way to solve the given problem. (This is what I consider as most important for the design of an API)

Regarding the bits versus badges topic, for now, we do not have a use case where one thread receives more than 32 signals. But the API does not impose such a limit either. On our way to shape Genode components more and more into an having an asynchronous mode of operation, I can see how we will hit such a limit in the future. For example, let us assume that we have one parent process that registers a segfault handler for each child. With a hard limit of 32 on the number of signal contexts, the parent could not easily accommodate more than 32 children. Although a limit of 32 or 64 or whatever number of bits looks like a practical compromise at the first sight, as soon as the limit is reached, it becomes not just a minor nuisance (i.e., a slight performance degradation) but the component developer will actually hit a hard wall. Therefore, I'd like to avoid such limits on the API level. I think, another argument in favour of using a badge would be to make the mechanism conceptually consistent with the way NOVA handles IPC. As you already mentioned, the mechanism is actually a kind of asynchronous IPC, just w/o payload.

@Nils-TUD

Your suggestion is spot-on. Making counter, badge, and list pointer part of a Vi (or signal context or...whatever is a good name) kernel object will solve the problem. This is how @m-stein implemented the mechanism for base-hw. This way, signals won't infinitely queue up in the kernel. The queue size is naturally bounded by the number of Vi kernel objects. I can't really comment on the question with regard to eventual issues with locking the list. I tend to be a bit naive on these kind of considerations. ;-)

blitz commented 11 years ago

As I understand Nils' solution it has starvation problems and events can be delivered out of order. But that is easy to fix. More comments later.

blitz commented 11 years ago

@nfeske Regarding the page fault handler example, I am not sure if I understand that correctly. if the page fault handler has to fetch page fault information from core anyway, why can't it receive the number of remaining faults at the same time? And for the timer example: Counting ticks is a bad idea, when you can instead just take the current time and compute the number of ticks yourself. So far I am not convinced that counting is useful. :)

But as you already said, what you really want is a cross-core asynchronous IPC operation with a fixed payload where counting comes naturally anyway. And it seems that is what you want for core local communication as well. That seems quite different from what NOVA actually provides. What is the rationale for moving to a completely asynchronous model?

And lastly, how do you implement your signaling interface on NOVA right now?

udosteinberg commented 11 years ago

@nfeske: I'd also like to know why you favor a completely asynchronous model. If you abandon synchronous IPC, you'll be at the mercy of the scheduler. Example: Threads A, B, C, D, E, F are all located on the same core and are scheduled round-robin. A wants to communicate with D. When A sends a request to D, the message will be delayed by the execution of B and C, and when D responds, the message will be delayed again by the execution of E and F. It should not affect overall throughput much, but I believe your latency will suffer badly.

nfeske commented 11 years ago

@blitz Sure, the page fault handling could be implemented differently. The current way is just the most straight-forward way. Regarding the timer example, this is not too far off from being useful. E.g., a web server might use signals to respond to incoming traffic (at a high rate), signals to respond to completed block I/O operations, and also a signal for processing timeouts or implementing watchdog functionality. Why would the server thread want to poll for the time if it could just use a periodic signal? You can argue that one can implement things differently. You always can program "around" the limitations of a system. In my opinion, the developer should just use the most straight-forward way to express his concern. In both examples, evaluating the signal count looks to me as the most intuitive way whereas your alternative (and supposedly superior) solutions are imposed by the limitation of the system.

The rationale behind moving towards asynchronous operation is to make the system easier to understand and more predicable. The new USB driver is a good example. In the past, DDE Linux was a nightmare to debug because of the concurrency of interrupt handlers and the main code of the driver. The move away from preemptive multithreading to cooperatively scheduled contexts combined with signals drastically reduces the likelihood of bugs, in particular race conditions. Indeed, we can implement spin locks as no-op. Also, dead locks that used to occur sporadically depending on the scheduling of the kernel would become immediately visible. It is a bit like comparing the execution model of Fiasco.OC and NOVA. On Fiasco.OC where system calls may block within the kernel, there needs to be a kernel thread per user thread and the kernel consequently needs maintains complex state (kernel stack) per user thread. In contrast, NOVA is modelled like a state machine. It never blocks in the kernel. Which one is simpler to reason about?

That said, I do not favour abandoning synchronous IPC. I just want to abandon the pattern of blocking at the server side.

The current signalling mechanism on NOVA works as follows: There is SIGNAL service in core that provides the creation of signal contexts and a way to submit signals to contexts (of any signal session) via an RPC call. Furthermore, there is an interface to let a signal-session client block for signal contexts created via this session. On NOVA, this blocking is implemented by a kernel semaphore. If a thread wants to issue a signal on a given signal context capability, it performs an RPC to its signal session. Core will just lookup the core-internal signal context object, increase its counter, and tell the corresponding signal session to wakeup the blocking receiver thread in the destination process. When the receiver wakes up, it queries the pending signal information (count and the opaque context value that we call "imprint") from core via another RPC call. This information, in turn, is used to locally enqueue the signal at the respective Signal_receiver object. The enqueuing operating will eventually wake up the thread that currently blocks on the Signal_receiver::wait_for_signal call.

You see, the current mechanism needs quite a few indirections. Therefore, I would welcome to replace it with a kernel mechanism.

udosteinberg commented 11 years ago

@blitz: Apart from the additional recall functionality - why couldn't the same be implemented with the current semaphores and the event mask living in a piece of shared memory? That is, trigger atomically sets the desired event bits in the event mask and performs a semup. And the blocked EC atomically read-clears the desired event bits from the event mask when it returns from semdown.

You don't want to block when the event mask is non-zero and I think you would get that, because a non-zero event mask means there is a pending semup on the semaphore, or there will be. You could only block for a longer time if the trigger-EC was preempted right after setting the event mask and before being able to do the semup.

So why does the event mask have to live in the kernel?

blitz commented 11 years ago

@udosteinberg Yes, you can simulate what I have implemented (sans the recall part) with semaphores and shared memory. The difference is that you trust everyone to use this shared memory correctly. A malicous/buggy program can repeatedly set the event bitmask to zero. It also means that programs have to remember the bitmask/badge with the semaphore object that they receive. Those virtual IRQs would not be self-contained. In my current implementation all that is hidden and misuse of this facility is hard.

udosteinberg commented 11 years ago

You should have access to the shared event mask iff you also have access to the trigger semaphore. Then, having access to the shared event mask implies you have access to the trigger semaphore. Which is equivalent to having access to the vIRQ in your version. And once you have access to a vIRQ, you can abuse the event mask as well, though not directly. You can either do fake triggers, or you can constantly wait for all events, thereby stealing them from others.

I agree wrt. the self-contained part, but avenues for misuse are roughly the same in both versions.

blitz commented 11 years ago

(Multiple) Virtual IRQs are bound to one EC. Only that EC can wait for events (the wait operation is not given a specific vIRQ object to wait for). The events bitmask is stored in the handling EC's UTCB, so it is only visible in the respective protection domain. You cannot steal events when given a vIRQ, you can only trigger it.

udosteinberg commented 11 years ago

Ah right, vIRQ is used for n-to-1 communication whereas semaphores can be used n-to-m.

alex-ab commented 11 years ago

Unfortunately, the usage of n-to-m flexibility is restricted, a EC can wait for a given point in time ever for solely -1- out of the n semaphores. (in the current implementation). In my understanding the main feature of @blitz vIRQ approach is that a EC can wait for several out of the -n- vIRQ in parallel.

udosteinberg commented 11 years ago

I was referring to the ability of having n ECs triggering and m ECs waiting on a semaphore, not having one EC waiting for m events.

nfeske commented 11 years ago

@udosteinberg Even though our current signalling API allows for m ECs to wait for n signals, this feature has remained unused. The only use case I can think of is the distribution of work to multiple worker threads. However, the policy how to distribute the work should better not to be implied by the scheduling of the kernel. It should be a conscious decision at the server. Hence, I think that this use case is moot, too. But having the ability to have one EC waiting for n signals is an important feature to have (in my opinion).

udosteinberg commented 11 years ago

@nfeske Another use case is synchronization among multiple ECs cross-core. We introduced semaphores a few years ago for the very purpose of supporting critical sections in user mode. Are we now saying we don't need that functionality?

@blitz If the signaling is always strictly n-to-1, is there any benefit to be gained from introducing a new kernel object? We might as well trigger the EC (that is bound to the vIRQ) directly.

Nils-TUD commented 11 years ago

@udosteinberg We need that functionality, but as @blitz showed with his Mutex implementation, it can be done with Vis as well. And since they have other advantages like being able to wait for multiple event-sources at a time, it might be the better approach. But as already pointed out, Vis have their downsides, too. A performance comparison would be nice for e.g. using Mutex instead of user space Sms.

blitz commented 11 years ago

@udosteinberg If you signal ECs objects directly, you lose the ability to distinguish events completely. It is analogous to the PT <-> EC dichotomy.

udosteinberg commented 11 years ago

@blitz Can you elaborate why we lose that ability? If you signal the EC through any of its vIRQs, then the relevant event bits are OR'd into the UTCB. I'm saying, you could just OR the bits directly into the UTCB without going through a vIRQ and the EC would have the same information at the end.

blitz commented 11 years ago

@udosteinberg If someone signals an EC directly without going through a vIRQ, he has to come up with the badge by himself and needs access to some shared memory where he can write this badge. If you think that is ok, you can also implement synchronous IPC by calling ECs directly (without going through portals) and mapping their UTCBs to all of their clients. :) You end up with the same information in either case, assuming that everyone behaves correctly.

blitz commented 11 years ago

@nfeske After reading your description of how signaling is currently implemented in Genode on NOVA, I can understand the desire for a different kernel mechanism. It sounds horrible and very slow. :) The question is whether the Fiasco.OC solution is the only correct one or if there is a simpler method to be found.

udosteinberg commented 11 years ago

@blitz You could have an interface like trigger (EC, event) where the hypervisor would atomically OR the event bits into the event field in the UTCB of the target EC. No need for establishing shared memory. Of course that doesn't allow you to limit what bits the triggering EC can set, but I'm not sure you need to.

blitz commented 11 years ago

@udosteinberg It still feels like a call(EC, portal-id) syscall instead of having a portal. Or is this analogy flawed?

udosteinberg commented 11 years ago

@blitz The portal contains crucial information, such as the entry EIP, which the calling party cannot - and for obvious security reasons - must not provide. It also contains the MTD, which the called party cannot provide.

The only similar thing in a vIRQ is the event field that can restrict what bits you can trigger for the target EC. And the open question is, whether you need to restrict that. After all, if you can trigger a single bit, you can unblock and/or recall the target EC. And the event field provides just a few extra bits of sideband information for the target EC to know where to look or not to look.

Besides, from the earlier discussion above, it appears that just a few bits of side-band info may not be enough. As far as I understood @nfeske, they would rather have (id, counter) tuples. And then suddenly things become a lot more complex. We would need to think about:

From a complexity and real-time perspective I would very much prefer a solution where you can just poke a few bits of sideband info into the target EC.

nfeske commented 11 years ago

@udosteinberg I agree on all of the points you stated. I was already a bit afraid that the locking aspect might raise a red flag. .-)

@blitz Please don't mind the "requirements" from our side too much. Our current mechanism works actually quite well and we do not have a pressing need to replace it. Performance-wise we haven't found the signalling mechanism to be all that relevant. For example, our USB stack uses the signalling API pretty intensively (for all kinds of external events such as interrupts, client requests etc.) So it is the poster child for our aspirations with regard to asynchronously working components. Still the throughput of USB/networking on the Pandaboard matches the performance of native Linux at the same level of CPU utilization. Of course, a microbenchmark will tell a different story. But the value of such microbenchmarks tend to be questionable except for quantifying the effectiveness of a particular optimization.

That said, if NOVA featured a mechanism that fits well with our API, we would love to use it. Also, the improvement of @blitz solution over semaphores would certainly be beneficial for us (just not for the signalling API).

udosteinberg commented 11 years ago

@nfeske @blitz @Nils-TUD If you had the choice between cross-core (id, counter) signalling and cross-core synchronous IPC, which one would you prefer? The former is basically a fire-and-forget kind of message transfer, where the sender does not block and the latter would be similar to existing IPC (with map and all), but the sender must block until the receiver is ready (and you don't get priority inheritance of course).

And choice here means either-or, not both :)

blitz commented 11 years ago

@udosteinberg If there are only those choices, I would strongly prefer the asynchronous mechanism, because it would naturally supersede kernel semaphores. It is a bit too heavy-weight for my use-cases, but perhaps it is a reasonable compromise. AFAICS synchronous cross-core IPC does not help in either of the discussed scenarios.

@nfeske Networking over USB is not a very taxing benchmark, as it will probably be a Fast Ethernet adapter, which is not very fast by today's standards. ;) For full sized frames, you do not even reach a packet rate of 10k pps. Although I would find it interesting to compare netperf -t UDP_RR (and TCP_RR/TCP_CRR) results between Genode and Linux. If you still don't see a difference, I'll concede to your point in that the signaling mechanism is not performance critical.

blitz commented 11 years ago

@udosteinberg Btw, how would you implement the (badge/id, counter) cross-core signaling? The main problem to avoid is to receive the wrong kinds of notifications at the wrong places, i.e. waiting for a critical section to become available and then waking up for every received network packet. That is why my current implementation provides a way to mask uninteresting events.

Nils-TUD commented 11 years ago

@udosteinberg That is really difficult to answer because it depends on your problem. But I don't see how this quesion should help. Because we're not discussing about having synchronous or asynchronous communication between cores, but whether to introduce such vIRQs, how to implement them and whether to drop semaphores then. And synchronous IPC between cores is a completely different mechanism that doesn't really help here, or do I miss something? So, IMHO we should compare it to semaphores and not to IPC.

nfeske commented 11 years ago

@blitz Please note that I was speaking of our experience with the Pandaboard. Here the USB/net benchmark is actually pretty intense. Even on native Linux, CPU utilization is almost 100% when the NIC gets saturated. We performed extensive benchmarks using netperf (native Linux versus L4Linux + USB/net driver running in a separate process). I plan to publish these experiences in the not-too-distant future. :-) That said, I won't go that far with drawing definite conclusions from it. It is just one usage scenario. My point was that I just don't think that every single mechanism must be blazingly fast. If a mechanism is rarely used but functionality-wise needed, I would definitely opt for having the mechanism rather than not. Using microbenchmarks to disqualify such a mechanism from being considered (or calling it "horrible and slow") looks a bit phony to me.

blitz commented 11 years ago

@nfeske What I meant is that while throughput might be perfectly ok with a complicated signaling facility, latency and per-packet overhead will probably suffer. Our experience so far with 10G NICs is that every form of per-packet overhead is to be avoided at almost all costs.

nfeske commented 11 years ago

@udosteinberg I agree with @Nils-TUD's sentiment as I desire synchronous cross-CPU IPC for reasons I deem unrelated to the current discussion.

udosteinberg commented 11 years ago

I was asking about cross-core IPC vs. signaling because I have been thinking about using the existing portal concept instead of adding a new kernel object. Core-local communication would continue to work as is, but you could then also bind a portal to a global EC (one with an SC). Then you can implement some form of portal "knocking", which signals the portal like a vIRQ object and caches the signal in the portal. A subsequent reply/wait by the global EC would fetch a pending signal and would otherwise block the EC. That fits nicely into the current model, but you need to define portal behavior cross-core. It's either synchronous IPC or asynchronous signaling, not both.

blitz commented 11 years ago

@udosteinberg Excluding local ECs from using the notification mechanism makes it a lot less appealing, because you would still need kernel semaphores to implement locks.

Nils-TUD commented 11 years ago

@udosteinberg Ah ok. In this case I would prefer asynchronous signaling, because it has the advantage that the one that triggers the signal does not need to trust the one that gets signaled. If you make it synchronous, i.e. the signaler is blocked until the signaled one has fetched the signal, you trust the signaled one to really fetch it. This is especially important if Sms are removed. Because the Vi concept could be used very nicely for services that want to notify clients about events. So, the client gives the service multiple Vis, one for each event he is interested in, and the service triggers them as soon as the event occurred. But of course the service does not trust the client.

But apart from that I don't see the advantage of using the portal concept for that. IMO it would be cleaner to introduce a new kernel object that has one specific purpose instead of "misusing" existing kernel objects for that. At least for me it's not very untuitive to have a signaling mechanism provided by a portal.

udosteinberg commented 11 years ago

@blitz If you block a local EC on a semaphore or vIRQ, you are abusing your caller. I don't think that's a concept we should encourage.

Nils-TUD commented 11 years ago

Well, there are cases where it makes sense. Let's say you have a keyboard driver and you don't want to use shared memory but create a local Ec and a portal for each client. To wait for keypresses the client simply calls the portal and in the portal you block on a semaphore until a key has been pressed. Then you pass it back. So, although I won't say that this is the default case or that it is always a good thing to do, I don't think that we should prevent it.

blitz commented 11 years ago

@udosteinberg Certainly not encourage, but also not actively make impossible. ;) Seriously, mutexes are a fact of life in some situations. And having threads that just cannot block makes composing code complicated, because in every library function you have to know if you may block or not. Not even the Linux kernel gets this right all the time..

udosteinberg commented 11 years ago

If you signal a local EC - one that doesn't have an SC bound to it - it won't be able to see the signal until the next request comes in. Surely that's a bit odd, no?

blitz commented 11 years ago

It depends in what you want to do. In the local EC case it will mostly be about synchronization: waiting for a mutex, waiting for a producer, sleeping in the HLT VM exit handler, ...

udosteinberg commented 11 years ago

Semaphores would continue to exist. I'm talking about the ability to signal global ECs using portals.

Is there a compelling use case where you would want to signal a local EC - a use case where a semaphore wouldn't suffice?

blitz commented 11 years ago

If there is a compelling reason to signal global ECs, then there is one for local ECs. Otherwise signaling is useless in library code and the programming model is needlessly not orthogonal, especially if there is overlapping functionality with semaphores.

nfeske commented 11 years ago

@udosteinberg Generally I like to idea to invoke a signal handler through a portal. This way, a single EC could respond to both synchronous RPC request and asynchronous signals. That said, I do not see this as a replacement for synchronous IPC as the most important use case for synchronous IPC is the delegation of capabilities, which is not possible with asynchronous notifications. This line of thoughts is interesting. But I would prefer to discuss it outside of this particular thread because it distracts from the actual topic (replacing semaphores with a n-to-1 signalling capability).