dabeaz / curio

Good Curio!
Other
4.02k stars 241 forks source link

The eager read/write optimization leaves curio programs highly susceptible to event loop starvation #106

Closed njsmith closed 7 years ago

njsmith commented 7 years ago

There was some discussion in #83 about removing the "eager write" optimization that curio does, where await sock.send will attempt to synchronously write to the socket directly, and only yield to the event loop if that fails. It turned out that the original reason we were considering removing it didn't actually apply, plus it significantly slows down the echo server microbenchmark, so for now the optimization survived.

But... I've just realized that actually the eager read/write optimizations create a serious correctness issue, in that they can cause ordinary-looking code to starve out the event loop in realistic situations.

Here's a standard curio echo server, along with a particularly aggressive echo client. The test case is thta we run two copies of the client, and watch to see how well the server switches between servicing both of them. The server is instrumented to report how often it switches.

The server output pretty regularly contains lines like:

Switched after 220 loops (100.7 ms)

What this is saying is that one of the echo_handlers managed to complete 220 passes through its read/write loop before yielding to the kernel, so all other tasks were starved out for 100.7 ms.

I even saw a

Switched after 340 loops (171.6 ms)

without trying particularly hard. And in theory the only limit on how high those numbers can go is that at some point some scheduling quirk or something on my laptop causes one of the clients to be slow to read/write and makes the server block; under the right conditions it could go arbitrarily high.

@glyph has some interesting comments on this optimization in Twisted, and how they removed it.

There's also an important difference between curio and twisted: in twisted, reads are only triggered by callbacks, so they always go through the event loop; it's only writes where you can even contemplate doing the eager-write optimization. So for them, this starvation problem is less likely to happen, because you can only hit it if you're doing an endless series of writes without any reads in between (and as glyph says, "/dev/urandom as a service has pretty limited utility"). In curio, we currently have both eager-read and eager-write optimizations, which makes the problem much easier to hit -- all kinds of realistic protocols are susceptible, starting with echo servers and proxies and moving on to things like an HTTP/1.1 server talking to a pipelining client.

I'm particularly worried about this because as you know I've been poking at HTTP server support for curio, and it's completely unacceptable for a fast client to totally lock up the server for unbounded amounts of time, even if it's a low-probability event :-(. (I.e., I don't think this is something where tweaking some heuristics will be enough; we need to able to actually put a hard bound on how long a task can monopolize the kernel.) It can be worked around by manually inserting calls to curio.sleep(0) (now that that's fixed), but that's not at all a nice solution in general.

The obvious solution is to make sure to always unschedule-then-reschedule on reads and writes. I can't immediately think of any cleverer (= faster) solution that is guaranteed to actually work.

dabeaz commented 7 years ago

I think this needs more study. If all of this is running on a single machine, there are two many weird quirks of scheduling that could be in play regarding threads, processes, and everything else. I'm also not sure that a client hitting the server at maximum speed through the loopback interface represents a scenario that would happen for real, but then again, I honestly don't know.

I just know that performing all sorts of extra system calls and scheduling around I/O isn't free. So, all things equal, I don't want to unnecessarily cripple the performance of curio if I don't have to.

Just as a note: I've seen similar "starvation" effects simply programming with threads (no async). A lot of this may be tied up in the scheduler of the OS.

I'll have more to say about this later.... have to step away to pick up kids though.

njsmith commented 7 years ago

The theoretical analysis is pretty simple: if the peer is fast enough that every time we go to write, there is room in the send buffer, and every time we go to read, there is data waiting in the receive buffer, then our read/write loop will never relinquish control. . In this case running on my laptop, the vagaries of scheduling mean that sometimes one of the clients will get delayed enough that it fails to keep the server's buffers primed, which is why we do eventually see the server switch tasks. [Edit for posterity: it turns out that the reason we eventually see the server switch is that the example server is using an unusually large N in its calls to recv(N) bytes -- large enough that it's fully emptying the receive buffer, each time. This sets up a race between when the server will call recv again versus when the client manages to get scheduled to send more data and re-fill the receive buffer; if the server wins the race then it will switch tasks. With a smaller N -- like realistic programs use -- then the receive buffer isn't fully emptied on each call, and the server becomes totally livelocked servicing a single client indefinitely. See below.] But if anything I'm actually more worried about more realistic situations: if the client and server are on different machines then that potentially lets the client be scheduled more smoothly. Ditto if the machines in question are more powerful with cores to spare. And if the server is more complicated than a simple echo server, so there are longer delays in between reads and writes while it does actually work, then that also makes the race condition easier to hit. . In the HTTP/1 server case, it would look like: (1) client sends a bunch of requests. These are short, and you can stack up quite a few of them in the server's receive buffer. (2) server takes them out one at a time, and for each one thinks about it a bit and then sends a longish response. (3) client reads each response as it comes, continuously draining the server's send buffer. So basically, as long as individual responses are small enough to fit in the send buffer (which is not hard, Linux will auto-tune this buffer up to multiple megabytes on fast connections), and so long as the client can read and ack each response before the server has finished generating the next one, then this client will starve out other clients, potentially completely and indefinitely and deterministically.

On Nov 13, 2016 10:45 AM, "David Beazley" notifications@github.com wrote:

I think this needs more study. If all of this is running on a single machine, there are two many weird quirks of scheduling that could be in play regarding threads, processes, and everything else. I'm also not sure that a client hitting the server at maximum speed through the loopback interface represents a scenario that would happen for real, but then again, I honestly don't know.

I just know that performing all sorts of extra system calls and scheduling around I/O isn't free. So, all things equal, I don't want to unnecessarily cripple the performance of curio if I don't have to.

Just as a note: I've seen similar "starvation" effects simply programming with threads (no async). A lot of this may be tied up in the scheduler of the OS.

I'll have more to say about this later.... have to step away to pick up kids though.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dabeaz/curio/issues/106#issuecomment-260203809, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlOaM2Z7bfaOx72qNMIexcTEbokO_0Dks5q91qsgaJpZM4Kwol5 .

glyph commented 7 years ago

When testing things like this, it does help to have a datacenter machine with a 10gigE card. You'd be surprised how fast those send buffers empty, especially without PyPy to speed your hot read/write loops along :-).

dabeaz commented 7 years ago

I understand the theory of it and agree that it might be a problem, but I'm not yet convinced that this would actually be a common situation--it seems that it would require the task in question to be pegged at 100% CPU load continuously and to never block ever. Given all the variances on a typical network, it seems like this would be pretty hard to achieve except in carefully controlled tests. And even if that did happen, it seems like you'd have a much more serious problem on your hands related to the scaling of whatever thing you were running.

I could be wrong though. I just don't know. How are other frameworks dealing with this? How do programs that use threads deal with this?

All things equal, I don't want to completely discount the performance overhead of constantly doing everything on the event loop by doing away with eager read/write. I could possibly envision some kind of hybrid approach that allows a certain small number of eager reads to take place first before blocking on the event loop. That might allow more normal workloads to take advantage of eager read/write while forcing the extreme case to more properly schedule itself.

dabeaz commented 7 years ago

As a followup, I'm looking at the code for gevent and unless I'm reading it wrong, it looks to me like it's doing eager recv/send in basically the exact same manner as curio. For example (in gevent/_socket3.py):

def recv(self, *args):
    while True:
        try:
            return _socket.socket.recv(self._sock, *args)
        except error as ex:
            if ex.args[0] != EWOULDBLOCK or self.timeout == 0.0:
                raise
        self._wait(self._read_event)

def send(self, data, flags=0, timeout=timeout_default):
    if timeout is timeout_default:
        timeout = self.timeout
    try:
        return _socket.socket.send(self._sock, data, flags)
    except error as ex:
        if ex.args[0] != EWOULDBLOCK or timeout == 0.0:
            raise
        self._wait(self._write_event)
        try:
            return _socket.socket.send(self._sock, data, flags)
        except error as ex2:
            if ex2.args[0] == EWOULDBLOCK:
                return 0
            raise

Maybe it suffers the exact same problem. However, gevent has certainly seen much more real-world use than curio. So, if there are known starvation problems there, that would be an interesting thing to know about.

Edit: Update: I did find this: http://stackoverflow.com/questions/13904114/why-is-gevent-sleep0-1-necessary-in-this-example-to-prevent-the-app-from-block

njsmith commented 7 years ago

To make the echo server example a little more realistic, I added a 1 ms busy loop to the send/receive loop, to simulate it doing a bit of on-CPU processing to each message (code). With this change, the output becomes:

~/src/curio$ python instrumented-echo-server-with-delay-loop.py
Connection from ('127.0.0.1', 58964)
Connection from ('127.0.0.1', 59144)
Switched after 8851 loops (12426.1 ms)
Switched after 1888 loops (2599.1 ms)
Switched after 5412 loops (7557.6 ms)
Switched after 25424 loops (35197.3 ms)
Switched after 175 loops (255.5 ms)
Switched after 35181 loops (49042.2 ms)
Switched after 15333 loops (20871.3 ms)
Switched after 25979 loops (35952.4 ms)
Switched after 8936 loops (12574.0 ms)
...

How are other frameworks dealing with this? How do programs that use threads deal with this?

With threads this can't happen because the scheduler interrupt will fire, the scheduler will notice that the eager thread has outlived its allocated time-slice, and it'll get pre-empted and bumped to the back of the line. Then the next time the eager thread is running, if another some I/O suddenly arrives for another thread that's blocked on I/O, the eager thread will get pre-empted to let the I/O bound thread run, because the I/O bound thread hasn't been using up its timeslices so it gets preferred access to the CPU. I guess you already know all this, but maybe it's useful for those reading the thread later :-).

I'm not yet convinced that this would actually be a common situation

I don't actually find this very reassuring, because even if it's rare/probabilistic... if you give me a choice between a server that usually handles requests in 1 ms but sometimes has a 100 ms latency spike, versus one that always handles requests in 5 ms, then I think almost everyone will prefer the 5 ms server. To me it doesn't make sense to talk about average case latencies until after at least putting a bound on tail latencies.

All that's required to hit this is a program that is occasionally CPU-bound and occasionally has fast peers. That seems pretty realistic to me :-/

I don't want to completely discount the performance overhead of constantly doing everything on the event loop by doing away with eager read/write.

I don't either, but I guess my inclination would be to focus on getting the semantics right first. Premature optimization is the root of all evil... or specifically, spending energy on optimizing now seems like a poor investment compared to working on correctness, because correctness is what lets us start writing apps on top of curio, and apps on top of curio are what let us start nailing down the semantics, and are also what give us real benchmarks to replace this echo server stuff. And it's a much easier to try and optimize a fixed set of semantics with real benchmarks than it is to try to optimize an unstable set of semantics without real benchmarks.

From a quick check it looks like fixing this in the most crude way possible (adding a curio.sleep(0) next to every eager read/write) only costs about ~15% in the echo server microbenchmark? We'll make that up easily later if/when we get serious about optimizing things (like, writing a libuv backend, or switching to pypy3, or...).

dabeaz commented 7 years ago

Maybe a different kind of question, but if you see a call like sock.recv() or sock.send() in a program, what expectations are you making about the underlying scheduling? Certainly if the call blocks, we know that the caller has to wait. If it doesn't block, what then? I'm not aware of any overall policy that mandates "fairness" of I/O operations in the operating system (that is, a requirement that says that it has to context switch on each recv() call). If anything, the host operating system would probably prefer to keep the running process scheduled until it actually blocks or it's preempted due to the expiration of its time slice.

I'm not suggesting that curio has to follow a similar approach. Maybe it doesn't. I don't know.

dabeaz commented 7 years ago

I still want to know what people are doing about this in other frameworks. Curio is not the only implementation that uses eager read/write. Truth be told, I wasn't even really thinking about optimization when that code was written--the implementation uses a pretty "standard" approach to non-blocking I/O that I've seen documented elsewhere. Namely, you try the operation first and if it causes an exception due to blocking, you suspend.

What happens if you run this echo benchmark on gevent?

njsmith commented 7 years ago

Maybe a different kind of question, but if you see a call like sock.recv() or sock.send() in a program, what expectations are you making about the underlying scheduling? Certainly if the call blocks, we know that the caller has to wait. If it doesn't block, what then? I'm not aware of any overall policy that mandates "fairness" of I/O operations in the operating system (that is, a requirement that says that it has to context switch on each recv() call).

I want to separate this into two pieces, conceptually: scheduling policy and pre-emption points. Definition: a pre-emption point is a places where the scheduler is invoked; the scheduling policy then decides whether the current task gets pre-empted, and if so then which task runs next.

For common operating systems, every instruction is effectively a pre-emption point, so send and recv aren't special in that regard. And their scheduler policies very much do enforce fairness -- they'll actually monitor how long different threads hold the CPU, and prioritize/deprioritize them accordingly.

For curio, we have a cooperative scheduler, so pre-emption points only occur at explicitly marked places. Users need to be aware of these, because it's their responsibility to ensure that a pre-emption point is executed on a fairly regular basis to avoid starving the event loop, and because they need to be aware of when they can be pre-empted for the usual concurrency management reasons.

So in curio, my expectation is that sock.send and sock.recv might not actually pre-empt, but that they are a pre-emption point, i.e. if the scheduling policy says that it's time to switch to someone else then that will actually happen. Right now, though, they're only probabilistically pre-emption points (say that 5 times fast!), which makes it more difficult to write correct programs -- you have to account for the possibility that you'll get rescheduled (so you need locking and all that), and everything will seem to work in simple cases, but you can't count on it actually happening, so then you have to remember to add more reliable pre-emption points on top of that.

[Edit to add: Right now in curio the distinction between pre-emption points and the scheduling policy is somewhat obscured, because the current scheduling policy is "every pre-emption point triggers a pre-emption and puts the task back at the end of the FIFO scheduling queue". But this could change, and the choice of scheduling policy is conceptually orthogonal to the eager read/write optimization. The problem with eager read/write is that it means that the scheduler never runs at all, so even a fancier scheduling policy would still be helpless.]

(It's funny that in the other thread right now I'm arguing that Event.set shouldn't be a pre-emption point while you're arguing that it should be, while here I'm arguing that sock.recv should be a pre-emption point and you're arguing that it shouldn't necessarily be. I guess the underlying disagreement is that I think it should be clear and predictable which calls are pre-emption points, and you aren't convinced this is necessary.)

njsmith commented 7 years ago

It occurs to me that curio could potentially get much fancier about its scheduler: e.g. it could spawn a worker thread to do the actual watching for I/O and calculate which task ought to be running at each time (taking into account an up-to-date view of I/O readiness, statistics on different tasks, etc.). And set a flag whenever a task has overstayed its welcome (i.e., when it would have been pre-empted in a non-cooperative system). And then when we hit a pre-emption point like sock.recv, all the main thread would have to do is quickly check this flag, and only yield if its set.

njsmith commented 7 years ago

Here's the gevent version of the instrumented echo server, with the 1 ms busy loop: https://gist.github.com/njsmith/6b0db29a4f5046b795396e78c763a651

Consistent with that stackoverflow link you found, it shows the same starvation behavior as curio, but even worse -- I've left it running for a while now with two clients, and the second client still hasn't received a single byte of data back. Currently it's at ~500,000 loops over ~9 minutes, and ~55 GiB of data transferred to the first client, without once yielding to the scheduler.

I guess what people do about this in the gevent case is that they post frustrated questions on stackoverflow... I feel like if your networking library can't be used without getting help from @minrk then maybe your networking library is too hard to use :-).

njsmith commented 7 years ago

On further investigation, it looks like the reason gevent seemed to perform worse than curio is that the gevent echo server was doing socket.recv(100000) while the curio echo server was doing client.recv(1000000) (i.e., gevent was reading 100 KB at a time, while curio was reading 1 MB at a time -- this is my fault for blindly copying them from examples/bench/ without looking at the details). If we drop the curio echo server down to using the more realistic 100 KB receive buffer, then it shows the same behavior as we saw with gevent, where one client hogs the event loop forever and never ever reschedules.

tl;dr: the curio results pasted above are unrealistically optimistic.

ZhukovAlexander commented 7 years ago

I could possibly envision some kind of hybrid approach that allows a certain small number of eager reads to take place first before blocking on the event loop. That might allow more normal workloads to take advantage of eager read/write while forcing the extreme case to more properly schedule itself.

First, I would agree, that given the low probability of this phenomenon to occur, the hybrid approach should work.

Second, I think, this issue could be resolved on the application level. Let the user decide, whether to use the eager r/w optimization, or to guard himself against such a DoS. To do this, user can explicitly tell the socket to receive only at most some number of bytes, and than voluntarily call await sleep(0). We can even introduce a special sleep(0) trap alias, like await switch(). Similarly, when sending some big amount data, user can break it into some reasonably sized pieces:

async def send(sock, data, chunk_size=1000):
    totalsent = 0
    while totalsent < len(data):
        sent = await self.sock.send(data[totalsent:totalsent + chunk_size])
        totalsent = totalsent + sent
        await switch()

The only thing we should definitely do is to stress the eager r/w behavior in the documentation, maybe even with the example solution to this corner case.

imrn commented 7 years ago

On 11/14/16, David Beazley notifications@github.com wrote:

I still want to know what people are doing about this in other frameworks. Curio is not the only implementation that uses eager read/write.

I doubt this is something Curio should solve. At least never be eager for read().

If any length is supplied (i.e read(10000)) Curio 'might' be eager if it would not block. At least its eagerness is bounded. But I'm not sure..

So I'm at the side of non-eagerness. Sufficiently 'eager' user might side step to a pure blocking sync thread which is also possible with Curio.

If performance/throughput is top priority then you can not beat blocking sync code anyway..

So please keep it stupid. This way we can be sure about what to expect.

dabeaz commented 7 years ago

After thinking about this, I'm going to keep the curio recv()/send() behavior as it is. A couple of reasons:

  1. The semantics are well defined. If I/O can be performed immediately, it is. If not, you block. That's it. There's no other hidden kinds of magic going on.
  2. These are the same semantics as in an operating system. A kernel is not going to make a context switch on a recv()/send() call if it can done immediately right then. There's no reason to switch--if the process is running, the OS has already decided it has the highest priority. Nothing about making the I/O call itself changes the priority. It's not going to just switch to something else unless the I/O blocks. If a context switch does occur, it's going to be in response to an unrelated interrupt of some kind (timer, incoming packet, etc.).
  3. I don't think curio can solve all possible scheduling problems with all possible scenarios of I/O. Yes, being DOS'd on the loopback interface might be a theoretical issue. However, I do a lot with size-prefixed messages that involves code like this::

    header_data = await sock.recv(headersize)
    header = decode_header(header_data)
    payload = await sock.recv(header.msgsize)
    # process message
    await sock.sendall(response)

    In this code, I don't really want to pay a scheduling penalty on the 2nd recv() call. It's almost always going to run immediately because the data is there. I would much rather just have curio do the I/O in the most basic way. Leave scheduling out of it.

  4. The problem can already be solved at an application level using sleep(0) as already noted.

I want to make one other comment. The code that curio is using here is NOT some weird aggressive optimization. This is basic non-blocking I/O code that you see in just about every tutorial you will ever see on non-blocking I/O. You try the I/O operation first. If it fails with a blocking error exception, you then fall back. That's how you do it.

I'm also not convinced that this is a real problem. I tried searching around for "gevent starvation" and didn't find much. Yes, there are occasional bizarre corner cases where people have seen that. However, a lot of eyeballs have looked at that project and it's doing things exactly the same way as curio. I'm just not seeing much of an issue surrounding this.

Note: I'd probably be in favor of a special switch() function or something to take the place of sleep(0)

imrn commented 7 years ago

The problem with sleep(0)/switch() is that, it is like a gentlement aggrement that should be called everywhere! Uggh..

But if nobody is greedy then there is no need for an agreement..

dabeaz commented 7 years ago

If you're putting sleep(0) calls all over your code, you're probably doing something wrong. The test case and problem being discussed here is an extreme corner case--an extremely aggressive client that's hammering a server on the loopback interface and doing little else. That is simply not a common scenario. I'm not discounting a high-traffic client completely as a potential problem, but I'd like to see it discussed with a more real-world application (e.g., HTTP) before doing much more with it.

Yes, there are certain situations where inserting a sleep(0) call might be beneficial, but a lot (if not most) network applications aren't going to need it.

There are also other ways that sleep(0) calls could be hidden. For example, you could subclass the curio socket class and insert it somewhere in there.

In any case, I want to see curio used in more real-world settings before considering any possible change to I/O handling at the moment.

imrn commented 7 years ago

On 11/15/16, David Beazley notifications@github.com wrote:

If you're putting sleep(0) calls all over your code, you're probably doing something wrong.

Sure it is wrong. There are two cases I can name:

1) During computationally intensive code among async code. (This is understandable)

2) If some kind of 'eagerness' is brought into Curio, then you'll need to use it frequently. And we will not want this.

In any case using sleep(0) can be forgotton or can not be enforced. This results in bad style and results... It's much better not to depend on it.

In any case, I want to see curio used in more real-world settings before considering any possible change to I/O handling at the moment.

I'm very content with it in my custom HTTP caching/blocking proxy which is under moderate to high load. I see no reason for a change.

njsmith commented 7 years ago

The test case and problem being discussed here is an extreme corner case--an extremely aggressive client that's hammering a server on the loopback interface and doing little else. That is simply not a common scenario. I'm not discounting a high-traffic client completely as a potential problem, but I'd like to see it discussed with a more real-world application (e.g., HTTP) before doing much more with it.

I can respect waiting to gather more information before making a decision, but I do strongly disagree with the characterization of this as some weird irrelevant corner case. Rather, I'd say that it's a nasty latent bug that will lurk in a large proportion of curio applications, and then when the moons align it'll cause surprising and hard to debug failures (including partial failures / weird performance issues). Not all curio applications, but even figuring out whether a particular piece of code is potentially susceptible to this issue is really difficult -- probably impossible in any moderately complex system. The whole reason I'm interested in curio is that (I think) it makes it much easier to write correct applications than asyncio does; the idea that we should accept some weird flakiness in order to get a speed increase that might not even be measurable in real life is pretty worrisome to me.

If this is really the behavior that you intend to standardize on, then I'm debating with myself whether I need to go back to my blog post about asyncio and curio -- specifically the part where I say "in asyncio the code looks right, but in fact you need to add this weird drain call to solve a subtle bug you probably won't notice until it's too late" -- and instead of saying "the curio code is correct" say "in curio the code looks right, but in fact you need to add this weird sleep(0) call to solve a subtle bug you probably won't notice until it's too late".

These are the same semantics as in an operating system. A kernel is not going to make a context switch on a recv()/send() call if it can done immediately right then.

Curio doesn't have to context switch on recv/send either, even if we do trap to the scheduler -- just write a better scheduler ;-).

Anyway, the difference from an operating system is: when writing curio code, users have to keep track of where the scheduler is being invoked, because otherwise their code will have concurrency bugs, and cancellation bugs, and fairness bugs. This is a problem that just doesn't exist when working with pre-emptive threads. So an important part of curio's public API is deciding which operations invoke the scheduler, and under what circumstances, and communicating that to users.

As you say, recv/send don't have to invoke the scheduler. But we have to have some rule about when the scheduler is invoked that we can teach users, and that they can remember, and easily apply when writing code. And so a really natural choice for that rule would be: (potentially) blocking operations always invoke the scheduler. It's not natural because it's an immutable law of nature; it's natural because users already expect that the scheduler might be invoked here, and have to write their code to prepare for that possibility, and because these operations generally are scattered around the code often enough to make a good set of scheduling points. So this way our rule is taking advantage of some nice pre-existing structure -- it "feels natural", which makes it easy to remember and apply.

dabeaz commented 7 years ago

I feel like there is a certain amount of catastrophizing going on this bug report. The particular test here (hammering a server as fast as possible on loopback) without even doing anything with the data is simply not representative of reality. I think it's reasonable to wait for more evidence before taking action.

For example: running this same benchmark using a network connection between two machines. Or modifying the client to do some kind of work on receive. For example, what happens if the client does a Unicode decode() on all of its received data?

An aside, when I was studying the GIL 7 years ago, I saw a lot of weird scheduling behavior simple due to the fact that multiple cores were being used. I contend that there could be all sorts of crazy things happening in this specific test because a) client/server are on the same machine b) multiple cores c) no real work other than I/O is being performed.

imrn commented 7 years ago

On 11/15/16, Nathaniel J. Smith notifications@github.com wrote:

"the curio code is correct" say "in curio the code looks right, but in fact you need to add this weird sleep(0) call to solve a subtle bug you probably won't notice until it's too late".

Never encountered such a case.

Infact, I'm very concerned that the 'solution', you've been demanding for a non-existent problem, will make sleep(0) mandatory.

drain() on asyncio is nasty. I would not favor another reincarnation here.

dabeaz commented 7 years ago

As a note: I'd never want to see direct usage of sleep(0) as the "solution" to this. If I were solving this for real, it would disappear into the underlying socket class. Or maybe a subclass of socket.

ncoghlan commented 7 years ago

I don't have a strong opinion on curio's behaviour here, but just noting a GIL-related example that I think is analagous to @njsmith's concern: CPython consistently releases the GIL around potentially blocking IO operations, so Pythonistas are accustomed to sock.send()/recv() being reasonably good guarantees that a given thread isn't going to starve other threads of access to the GIL. This is done unconditionally, before making the potentially blocking OS level call. The naive interpreter level scheduler can still cause latency spikes when CPU bound threads are mixed with IO bound ones, but IO bound threads will always play nice with each other.

However, I think your main real-world defence against the specific problem being raised here isn't anything to do with the typical relative speeds of clients and servers, but everything to do with typical round trip latencies for TCP packet acknowledgements, and that's where having both the hostile client and the server under attack being on the same system is misleading: instead of there being a ping delay between "data sent by server" and "server receives notification that the data was received by the client", those two events are nigh simultaneous. If I'm right, then putting Vaurien between the client and server with a stable 10+ ms delay in each direction may be enough to eliminate the starvation: http://vaurien.readthedocs.io/en/1.8/#using-vaurien-from-the-command-line

A UDP echo server wouldn't have that defence, but UDP servers aren't connection oriented either, so "client starvation" isn't really a thing that can happen the way it can with TCP.

minrk commented 7 years ago

Infinitely greedy send/recv has been a problem in some async uses of pyzmq, though the cases I'm aware of are generally described as "always publish x as fast as I can." This is because it's close to impossible for send to block in most cases. Starvation on read has been harder to accomplish in reality, and I don't believe I've ever had a report of it.

What I've been interested in for async zmq is a starvation limit - i.e. greedily send/recv up to a certain amount on a socket in a loop, but cap starvation. This could be a tuning parameter - high limit for maximum throughput on a given channel, but possible starvation of other channels during load. Low limit for maximum 'fairness', but tradeoff in maximum burst throughput. The measure could be either time or data.

minrk commented 7 years ago

I took a stab at bounded greedy handling in pyzmq here. I set the default of 10ms maximum starvation. In my case on pyzmq, I got roughly these performance penalties in a pyzmq async throughput benchmark:

For comparison, the introduction of greedy handling here provided a ~5x improvement in the same benchmark.

(note that these are benchmarks intended to stress the overhead of the Python implementation, not actual network behavior)

defnull commented 7 years ago

Aren't time based deadlines quite expensive? Syscall and all? How about a simple counter? Be greedy for maximal N times in a row, for example.

minrk commented 7 years ago

Aren't time based deadlines quite expensive? Syscall and all?

Yup, I think so! I plan to also take a stab at size-based limits, to compare in reality. size-based limits are also probably simpler for plain sockets than they are for pyzmq.

njsmith commented 7 years ago

@dabeaz:

catastrophizing

The benchmark where you get complete starvation for minutes on end isn't really the case I'm worried about; it's just that I tried the most obvious simple benchmark and that's what happened :-).

What I'm really worried about is things like, a proxy server where a high bandwidth client happens by chance to send a large burst of data at the right time to starve out all other connections for 10s or 100s of milliseconds, creating a "bursty" latency profile that creates a low perceived quality-of-service. Or the HTTP/1 server I described above, where a high-performance pipelining client could starve out other connections. (Usually we... like deploying high-performance programs?) Or a gRPC server: gRPC uses HTTP/2 to multiplex multiple RPC calls over a single connection, which can create conditions similar to the pipelining case. I'm imagining after-action reports like:

We've now completed our analysis of the outage this weekend. In our system, front-end servers receive requests, and then issue a series of RPC calls to a layer of back-end servers, which in turn issue queries against the database layer. Two weeks ago, we deployed new caching logic in the back-end servers; when the cache hits (~90% of the time) this allows the back-end to respond to RPCs without accessing the database. This improved average response latency, but there was an unexpected interaction with the "eager read/write optimization" in the back-end's networking library: previously, between each RPC request and RPC response, the synchronous database access forced the RPC handler to be rescheduled, which in turn enforced fairness across different RPC connections. Now, in cases where we hit the cache, the response can be computed entirely on-CPU without rescheduling; if multiple such requests come in multiplexed on the same connection, then we may find ourselves ignoring requests on other connections until we've completed sending all of the cache hit responses. Normally, the only effect this has is to create a somewhat spikier latency distribution. But early Sunday morning, the following conditions lined up: a front-end server issued 64 parallel RPC requests; by chance, our sharding algorithm assigned 23 of these to the same back-end server; by chance, 21 in a row of those 23 requests were cache hits. The back-end server processed all 21 of those requests without yielding to the event loop, making it unresponsive to other requests for long enough to trip our watchdog timer and remove that back-end from the service pool. This increased the load on the other back-end servers, which had the effect of somewhat increasing the average latency across the board. This wouldn't have been a problem -- we maintain some excess capacity exactly to handle situations like this -- except that the higher baseline latency combined with our newly-spiky latency distribution increased the probability that one of the remaining servers would have an "unlucky enough" series of requests to trigger their watchdog timer. At T+42 minutes, this occurred on a second server, thus further increasing the load on the remaining servers, which in turn created a further increase in baseline latency, and we spiraled down in a cascading failure; complete service outage occurred at T+63 minutes.

Does that seem like unrealistic catastrophizing? It sounds pretty typical to me...

I think it's reasonable to wait for more evidence before taking action.

I don't disagree with this. I'm just a little baffled that so far the evidence in favor of the eager read/write optimization is a 15% speedup on one probably-irrelevant microbenchmark, and we could reduce this gap even more by optimizing the scheduling, yet everyone else seems to think that this is probably worth it. I guess that when it comes to distributed systems prone to emergent complexity then I'm just a "better safe than sorry" kind of guy :-). All I'm asking is that eventually, the public API should guarantee that a simple send/recv loop has some kind of bound on its worst-case unfairness.

@ncoghlan

However, I think your main real-world defence against the specific problem being raised here isn't anything to do with the typical relative speeds of clients and servers, but everything to do with typical round trip latencies for TCP packet acknowledgements, and that's where having both the hostile client and the server under attack being on the same system is misleading: instead of there being a ping delay between "data sent by server" and "server receives notification that the data was received by the client", those two events are nigh simultaneous

(1) There are lots of network apps running in data centers where the typical RTT is sub-millisecond. (2) The RTT is actually almost irrelevant to all of the examples I gave above. If you have a send/recv loop that's implementing a classic request/reply protocol where I send something, and then there's nothing for me to recv until after you've received my message and sent a response, then almost any RTT is enough to avoid these issues. That's why I'm talking about situations like proxies and pipelining and multiplexing, where send and recv can happen independently :-). You suggest that even in this case the slowness of TCP packet acknowledgements might save us -- but the only way this will help is if the TCP packet acknowledgements are so slow in coming that the make the send buffer become completely full, or the receive buffer completely empty. But if this happened it would be a wasteful stall! The whole goal of TCP's flow control mechanisms is to try to prevent this from ever happening -- and in particular, all modern TCP stacks will estimate RTT and try to increase the sizes of these buffers to make sure that ACKs have enough time to arrive before this happens.

Note in particular that in the echo server example, RTT is basically irrelevant. What is relevant is that a client is able to push enough data to saturate the server's CPU. If you have a well-functioning network, then a client should be able to push that from another machine, with or without a 10 ms RTT, etc. Of course normally you wouldn't want to run a server at 100% CPU usage in its steady state -- but intermittent/temporary CPU saturation seems normal and even desirable. (If you have work to do, you want to get it done as fast as possible!)

A UDP echo server wouldn't have that defence, but UDP servers aren't connection oriented either, so "client starvation" isn't really a thing that can happen the way it can with TCP.

But a realistic UDP server probably does have other background tasks running on the same event loop that could get starved: e.g. DNS servers serve UDP and TCP simultaneously, or you might have a background task that does some housekeeping on a timer tick.

@defnull

Aren't time based deadlines quite expensive? Syscall and all?

Please don't jump to conclusions about relative performance like this... it tends to really confuse matters.

FWIW on my laptop, it looks like incrementing a Python-level global integer costs 60 ns, and calling time.time() costs 90 ns. If I stash time.time in a local variable, t = time.time, then calling t costs 55 ns:

In [4]: i = 0

In [5]: %timeit global i; i += 1
10000000 loops, best of 3: 59.7 ns per loop

In [6]: %timeit time.time()
10000000 loops, best of 3: 88.7 ns per loop

In [7]: t = time.time

In [8]: %timeit t()
10000000 loops, best of 3: 54.6 ns per loop

So apparently time based deadlines are cheaper than counters! Maybe. To really know you'd have to implement and optimize both. And then you'd need to weigh the absolute costs against other considerations like maintainability, reliability, etc. Personally I find it weird to be making any kind of decision at all about public API semantics based on comparing little 10-20% speedups/slowdowns.

ncoghlan commented 7 years ago

@njsmith Thanks for the clarification - I was thinking in terms of the "frontend server talking to client devices" case rather than the "backend server talking to frontend servers" case, and I agree that in the latter case loopback testing is more likely to be representative of real network conditions.

imrn commented 7 years ago

I think, I'll go into the to very basics again. This is my perspective and I hope it can help where to draw the line.

await revc/read() await recv/read(1000) await send/write()

should not do anything more than blocking counterparts. And they should return ONLY AFTER REALLY COMPLETING the operation.

And the kernel should only do mere enough to satisfy these calls. There shouldn't be any "if there is some data lets get it" attitude.

dabeaz commented 7 years ago
  • So for Curio the moral of the story is

await revc/read() await recv/read(1000) await send/write()

should not do anything more than blocking counterparts. And they should return ONLY AFTER REALLY COMPLETING the operation.

The current version of Curio already provides these semantics.

How does a recv() call work with threads? If data is available, it is returned immediately. If not, the calling thread is blocked by the OS kernel and a new thread is scheduled.

How does a recv() call with with Curio? If data is available, it is returned immediately. If not, the calling task is blocked by the Curio kernel and a new task is scheduled.

The only difference between the two concerns who is responsible for blocking the caller when no data is available. Otherwise, the semantics are the same--as currently implemented in Curio. I don't get this "ONLY AFTER REALLY COMPLETING" business. If curio tries a recv() call and it runs without an exception, then the operation has, in fact, completed. It's not an "optimization." It's not a "hack." There is nothing to see here.

The issue at hand in this report is whether or not Curio should additionally invoke the scheduler in all I/O operations. I don't think it should. And neither should anyone else who wants the semantics to be exactly the same as blocking I/O in synchronous code.

dabeaz commented 7 years ago

As the Curio BDFL, I've given a lot of thought to issue. I do not intend to change the current behavior of curio with regards to this. There are a couple of reasons:

  1. I think low-level I/O functions should be solely focused on I/O and nothing else. Yes, these operations might complete immediately or they might block. That's fine. However, putting extra scheduling in there ends up being a side-effect. I don't like functions with complex side-effects that can't be controlled.
  2. I don't think there is a "one sized fits all" solution to task scheduling. What works well for one use case might not work well for another. I don't think Curio needs to solve all of these problems. Instead, it should focus more on making tools available to solve these problems if dictated by the problem at hand.
  3. It is easier to add complex functionality to simple functions than it is to take away functionality from complex functions. Curio already provides a mechanism to invoke the scheduler (sleep(0)). That mechanism can be incorporated in all sorts of ways. Sure, code could call it directly, but you could subclass various Curio objects and add scheduling calls in there if you wanted. This is not a difficult concept. You could also make custom objects that do exactly what you want. Most task-level code in Curio is pretty simple to be honest.
  4. Curio is an expert tool for expert developers solving a hard problem (concurrency, network programming, etc.). I don't want to put a bunch of non-removeable training wheels on it. Yes, you can shoot yourself in the foot. Be careful then.
  5. Last, but not least, premature caution may be as bad as premature optimization. Yes, there are some bad things that can apparently happen in theory. I'd much prefer to hear about bad things happening in actual production code though. More evidence is needed. As aside, one of the reasons I asked about gevent is that I consider it to be most like Curio in terms of its internal operation. If there are known problems there, they'd likely be problems in Curio too.
imrn commented 7 years ago

On 11/16/16, David Beazley notifications@github.com wrote:

The current version of Curio already provides these semantics. I don't get this "ONLY AFTER REALLY COMPLETING" business.

It's for write() as described for asyncio. You already mentioned it is the case for Curio. No problems.

The issue at hand in this report is whether or not Curio should additionally invoke the scheduler in all I/O operations. I don't think it should. And neither should anyone else who wants the semantics to be exactly the same as blocking I/O in synchronous code.

Agreed.

glyph commented 7 years ago

How does a recv() call work with threads? If data is available, it is returned immediately. If not, the calling thread is blocked by the OS kernel and a new thread is scheduled.

How does a recv() call with with Curio? If data is available, it is returned immediately. If not, the calling task is blocked by the Curio kernel and a new task is scheduled.

This comparison doesn't really hold up.

First, that's not how network I/O scheduling works on real kernels. Once your thread has context switched into the kernel, it might get scheduled out if too many other threads are ready to run, whether a result is available from your particular syscall or not. This can happen as a result of a call to recv or send or select, but it might also happen as a result of a timer interrupt. The kernel scheduler is invoked very regularly.

By contrast, in a cooperatively-scheduled framework like Curio's, user-space scheduling is all that we have, so being more careful about scheduling decisions and taking the opportunity of a recv or send call seems reasonable, since it may be all you get.

I should point out that interactive performance problems will happen anyway - see for example this Stack Overflow question about interactive performance on Twisted's SSH server becoming untenable when one user is hogging all the bandwidth - http://stackoverflow.com/questions/40601119/python-ssh-server-twisted-conch-takes-up-high-cpu-usage-when-a-large-number-of/40627446 - the ability to do scheduling at all is only the first step to doing some kind of fair scheduling.

imrn commented 7 years ago

On 11/16/16, Glyph notifications@github.com wrote:

taking the opportunity of a recv or send call seems reasonable, since it may be all you get.

I guess, you mean that opportunity should be used to protect the server from a hogging client. (Infact, providing fair share for other clients) Correct?

Then are we talking about a hardened scheduler? Or a hardened server app calling sleep(0) at appropriate points?

glyph commented 7 years ago

I guess, you mean that opportunity should be used to protect the server from a hogging client. (Infact, providing fair share for other clients) Correct?

Or a client from a hogging server, of course :).

Then are we talking about a hardened scheduler? Or a hardened server app calling sleep(0) at appropriate points?

A "hardened" app can't control what the libraries that it's integrating does. This kind of tuning needs to be done in the event loop core, not sprinkled haphazardly through application code.

That said, it just needs to be tunable in the core somehow; although I believe it should be the default, it doesn't need to be the default, as long as you can swap in another eagerness-optimization provider.

imrn commented 7 years ago

On 11/16/16, Glyph notifications@github.com wrote:

I guess, you mean that opportunity should be used to protect the server from a hogging client.

Or a client from a hogging server, of course :).

A "hardened" app can't control what the libraries that it's integrating does. This kind of tuning needs to be done in the event loop core, not sprinkled haphazardly through application code.

Then we are talking about a pluggable scheduler scheme. I'm fine with the current non-hardened one which Dave sees sufficient as default.

Others may choose a custom one that fits their use case. As for the principle, is this the end for this thread?

Any further closing remarks?

dabeaz commented 7 years ago

I was looking at some of the code and got to thinking. Consider the current implementation of recv()::

async def recv(self, maxsize, flags=0):
    while True:
        try:
            return self._socket_recv(maxsize, flags)
        except WantRead:
            await _read_wait(self._fileno)
        except WantWrite:
            await _write_wait(self._fileno)

The code is "eager", but if you modified it to wait for reading first (i.e., calling _read_wait()), I'm not sure it would work at all. Protocols such as SSL sometimes require sending data before performing a receive. That's why the WantWrite exception and _write_wait() call are there. If you did a _read_wait() on the socket when it actually needed to write, the whole thing might deadlock. The only way to know that it wants to write is to try the recv() call first.

njsmith commented 7 years ago

@dabeaz

Curio is an expert tool for expert developers solving a hard problem (concurrency, network programming, etc.).

I think this might be at the core of why we're at such cross-purposes here... from conversations so far, I get the impression that the experts you have in mind are folks (like yourself?) whose main interest is in messing about with the guts of event loops and experimenting with all the little details?

OTOH, when I think of expert developers, the image in my mind -- the heroes I try to emulate -- are people with brains the size of planets who spend that energy on building complex systems out of simple, straightforward code that's obviously correct, and who appreciate APIs that make that easier to do.

Right now, the only way to write an obviously correct I/O loop in curio is to put an await curio.sleep(0) in it. The emphasis here is on the word "obvious": a loop that doesn't have sleep(0) might be correct, but the only way to know is to do a detailed analysis of the surrounding code, client, expected deployment conditions, etc. If I were writing a code review checklist for a project using curio, one of the entries I'd have to put in is: "For each I/O loop: is there either (a) a call to curio.sleep(0) or (b) an analysis of why this is unnecessary?" Not the kind of thing you need when you're trying to build something complex and curio is just the boring networking layer.

Also, it should hopefully go without saying, but just in case: these are both totally valid approaches! Just... it's not clear how compatible they are. I guess we will probably continue to push-and-pull on this going forward.

@glyph

the ability to do scheduling at all is only the first step to doing some kind of fair scheduling.

I actually went down the rabbit hole reading about fair scheduling last night, and it looks like it'd actually be super easy to implement in curio :-). I'm trying to resist though, since there's really no use until we have more complex apps to play with... but yeah, even a smart scheduler can't do anything unless the tasks have schedule points in them.

@dabeaz

The code is "eager", but if you modified it to wait for reading first (i.e., calling _read_wait()), I'm not sure it would work at all.

Heh, good point! As a technical matter, though, it's easily solvable -- my suggestion is that recv should guarantee that it acts as a schedule point, not that it necessary calls _read_wait(). So e.g. this would work fine:

async def recv(self, maxsize, flags=0):
    while True:
        try:
            result = self._socket_recv(maxsize, flags)
            await _sleep(0, False)
            return result
        except WantRead:
            await _read_wait(self._fileno)
        except WantWrite:
            await _write_wait(self._fileno)

One could (should) make that prettier and smarter, but it accomplishes the basic requirement.

imrn commented 7 years ago

On 11/17/16, Nathaniel J. Smith notifications@github.com wrote:

async def recv(self, maxsize, flags=0): while True: try: result = self._socket_recv(maxsize, flags) await _sleep(0, False) return result except WantRead: await _read_wait(self._fileno) except WantWrite: await _write_wait(self._fileno) One could (should) make that prettier and smarter, but it accomplishes the basic requirement.

What's performance cost of this on your hogging setup? We may not want to pay for what we do not intend to protect..

njsmith commented 7 years ago

What's performance cost of this

On the echo server microbenchmark in examples/bench/curioecho.py, adding that _sleep to both recv and sendall gives about a 15% slowdown in overall requests/second. That's for the crudest version like above; obviously one could (should) optimize this further.

yarko commented 7 years ago

Just a quick observation, as I've started to read (and then scanned) through this thread at lunch.

1) async/await is (my mind) cooperative multitasking.

As such, I read this as if seeing cooperative- and preemptive- conflated.

2) Nick's example of I/o blocking behavior calls to mind the "context" lib of golang (i.e. group timeout - read that as cooperative - behavior, a tool for the convenience of the intentional programmer)

3) Nathaniel's concerns seem to be around externally controlling a misbehaving cooperative- ... I suggest clearly separating the two (cooperative & preemptive) and discussing them and addressing them separately, distinctly.

4) because of this perspective, I find myself siding squarely with Dave (keep it as is) and Imran - curio should not solve this, it should rightfully be left in the hands of a competent cooperative-multitasking developer to handle as appropriate;

5) because of Nick's notes, I think notes for managing inadvertently triggering what Nathaniel is concerned with should be handled with appropriately documenting ("this is cooperative-multitasking, and here's what that implies...") and perhaps adding convenience libraries/classes for programmers (or let others add), which is to say I fall squarely with Dave and Imran (and through some cognitive translation, I imagine, Nick).

Caveat - I spent no more than lunchtime to scan this (but my impression throughout has been consistent, so that's why I'm sharing).

Cheers, Yarko

On Wednesday, November 16, 2016, Nathaniel J. Smith < notifications@github.com> wrote:

What's performance cost of this

On the echo server microbenchmark in examples/bench/curioecho.py, adding that _sleep to both recv and sendall gives about a 15% slowdown in overall requests/second. That's for the crudest version like above; obviously one could (should) optimize this further.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dabeaz/curio/issues/106#issuecomment-261159792, or mute the thread https://github.com/notifications/unsubscribe-auth/AACOyTwYDcPnCWbONXODV9mTOR4Zgfbiks5q--eKgaJpZM4Kwol5 .

imrn commented 7 years ago

On 11/17/16, yarko notifications@github.com wrote:

5) because of Nick's notes, I think notes for managing inadvertently triggering what Nathaniel is concerned with should be handled with appropriately documenting ("this is cooperative-multitasking, and here's what that implies...") and perhaps adding convenience libraries/classes for programmers (or let others add), which is to say I fall squarely with Dave and Imran (and through some cognitive translation, I imagine, Nick).

  • Being forged by a hogging -possibly hostile- adversary is a concern. (We don't like being listed in security bulletins. Do we?)
  • But imbursing the cost of prevention to all Curio users is another concern.

As you point out, properly documenting the tradeoff and providing hooks for the solution for those interested can be best option I can think of.

Depending on the use case possible levels of solutions are:

ncoghlan commented 7 years ago

I'm actually slightly on @njsmith's side, as the downsides here are asymmetric (a constant small performance cost which may be tuned as low as you like via a "yield to the event loop at least once every N IO operations based on a thread-local counter" mechanism vs "occasionally monopolise the CPU").

I see the status quo in curio as somewhat similar to the old bytecode-counting based switching model for CPython's GIL - it was mostly OK, but would sometimes go horribly wrong if you were calling a lot of computationally intensive C code that held on to the GIL without calling back in to Python code. People's solution to that was sticking time.sleep(1e-9) at various points in their code as explicit yield points, but switching out the bytecode-counting based pre-emption for time based pre-emption was still a good idea.

imrn commented 7 years ago

On 11/18/16, Nick Coghlan notifications@github.com wrote:

I see the status quo in curio as somewhat similar to the old bytecode-counting based switching model for CPython's GIL - it was mostly OK, but would sometimes go horribly wrong if you were calling a lot of computationally intensive C code that held on to the GIL without calling back in to Python code. People's solution to that was sticking time.sleep(0) at various points in their code as explicit yield points, but switching out the bytecode-counting based pre-emption for time based pre-emption was still a good idea.

It's a solution which is at same the school with "Garbage Collection". There is enough well understood Computer Science Literature about "Garbage Collection" so I'll not start any other here.

However, you may guess what will be my stance on a python specific solution, as being a person always eyeing a compiled language with enough batteries.

On the other hand, I'm impressed as this thread turning into a discussion with many elegant contributers touching many aspects of Computer Science subjects with such possible headings:

Good news for Curio is that, a problem touching so many aspects of Computer Science has a solution close to a "One-Liner". All those noise we are making here is for "Where to put it?".

Anyway, info in this thread should definetely crawl into the documentation under "Advanced Topics", augmenting the "Documentation/Code" ratio, which is one of the best even today.

njsmith commented 7 years ago

It's probably worth pointing out that this is part of a larger discussion about what sorts of guarantees different curio operations should make or not make around scheduling and cancellation and similar. See #101 for more of that broader discussion.

dabeaz commented 7 years ago

I'm closing this for now. Currently Curio runs all tasks that are ready after an I/O poll. After that, it goes back to polling regardless of whether or not any of those tasks have gone back on the ready queue. It largely seems to address this.