Pylons / waitress

Waitress - A WSGI server for Python 3
https://docs.pylonsproject.org/projects/waitress/en/latest/
Other
1.45k stars 176 forks source link

Notify client close #310

Closed viktordick closed 4 years ago

viktordick commented 4 years ago

Implementation of the idea discussed in #308. The tests don't pass yet and I should also implement additional tests for this, but hopefully this helps with the discussion.

viktordick commented 4 years ago

@bertjwregeer Is there a chance that this PR will be accepted if I fix the tests and add new ones for the new feature? Or do you think our request is too exotic and the added complexity should be avoided?

viktordick commented 4 years ago

@bertjwregeer Thanks for the review.

Regarding 3: I guess there are three possibilities. A) the application inserts a callable into the environment that waitress then calls (in the main thread, not in the worker thread!). B) Waitress inserts a flag into the environment once the connection is closed and the application checks for it. C) Waitress inserts a callable into the environment that the application can use to check for client disconnection.

I was thinking that A) would be easiest to deal with on the application side, especially if we are in a situation where we spawned another process or are currently waiting for an SQL database query to be performed. When triggered, we could then terminate the spawned process or the database worker process.

However, this has the mentioned disadvantage that the handler interrupts the main thread of waitress. And maybe it also adds unnecessary complexity (maybe the additional lock was necessary due to this approach and is not required with C, but I am not sure yet). I will try to think about this a bit more.

Regarding 1: I think the additional lock is necessary because handle_read is running in the main thread while service is running in the task thread and before this change, they never were executed simultaneously for the same channel (while there was a request being processed, readable always returned False). I will try to think a little bit more about this - maybe by switching the strategy to C it is not necessary. But I do not think it is related to the strategy, it is simply because we have one thread that might create new requests (the main thread) and one that is processing requests, and they now may be running simultaneously which they did not do before.

Regarding 2: I was not sure how to implement a flag that may be set using PasteDeploy before, so in this draft I used the global variable max_requests_per_channel, which should restore the default functionality if set to 0 (I guess the name should be changed to something like max_lookahead_requests ). If I understand correctly, I can set this as one of the parameters to waitress.serve. By giving it a default value of 0 there, this should restore the default functionality. I will try it.

viktordick commented 4 years ago

Addendum: Regarding the difference between strategies B and C, I guess with C we do not need to make sure to catch the environment of the currently running task as well as any possibly queued tasks. However, I guess we anyhow do not want to start any more tasks if the client disconnected, do we? In any case, C seems to be superior to B. (C was also the strategy that @mmerickel suggested).

viktordick commented 4 years ago

@bertjwregeer I tried to implement the requested changes. There is still a lot to be done regarding tests and documentation, but is the general idea now something that might be accepted?

Now that I looked over it again, the additional lock I had in the original attempt was due to race conditions with self.current_task, which was needed in order to implement strategy A as mentioned above. It is no longer needed with C (the remaining race condition is between received() adding requests to be processed and service() removing them, but self.requests.extend(requests) is already protected by the GIL).

viktordick commented 4 years ago

@bertjwregeer I changed the implementation so service() only processes one request and re-inserts itself using add_task if there are more requests. This requires a lock to cleanly communicate if the main thread should call add_task or if the task thread should do it and prevent race conditions. However, I am pretty sure that I could re-use the task_lock for this. It is no longer necessary to guard the complete service method with this lock since by only re-inserting the next task at the very end of the function, it is ensured that no two requests of the same channel are processed in the wrong order or in parallel.

However, this makes the difference to master rather large since most of the content in service() changed the indentation level.

One change I am not sure about is the handling of the output buffers. Due to the change, they are flushed after each request separately. Before, if the client sent multiple short requests in a pipeline such that they would be picked up with a single read(), the output buffers might only be flushed once all of them were processed (I think).

I did some obvious adjustments to the tests. I also dropped test_service_no_requests since it does not work anymore (service() assumes that there is a request to be processed), but it anyhow does not handle any real life situation, add_task is only called if there is a request to be processed.

There are two more failing tests that I will check next. And, of course, I need to add tests for the new functionality.

digitalresistor commented 4 years ago

@viktordick thanks for updating it, I haven't taken a look yet but I saw that @mmerickel provided some feedback. I've been busy moving, hopefully once things things calm down a little bit I'll get back to this and help figure out a best path forward.

viktordick commented 4 years ago

@bertjwregeer, @mmerickel So I tried to read through the code in order to get a feeling of when each variable is used and came up with the following:

task.close_on_finish

close_when_flushed

will_close

connected

In summary: connected is the signal that everything is closed and finished and nothing should be done on the channel until the dispatcher reuses it for a new connection. Using connected directly from the main thread on client disconnection might fail to clean up output buffers and depend on the garbage collector to do this, which might miss buffers that have overflown into files. Worse, it might contaminate the next connection that reuses the same Channel instance, although I have not dug this deep into the code to verify this.

Even though the tests were successful once I merged client_disconnected and not connected, I guess it is better to restore the separation? Maybe under another name and with additional comments explaining the difference.

Curiously, the tests were successful after I pushed 41155fc, but after I changed one comment that was not yet up-to-date, a test failed. There seems to be a race condition that I have been unable to reproduce until now. Not sure if it was already there before or if the problem is related to the handling of connected.

Could any of you please confirm my conclusion or show me where I am mistaken?

mmerickel commented 4 years ago

Thanks @viktordick for going through this, I think I agree with most of your analysis. A channel is never reused so there is no concern about the buffer cleanup or connected getting reset to True.

From what I'm seeing, connected is the right hook to use here to indicate that the connection is dead and the client is gone. This is what your current code is using. It sounds like this is a concern for you but I'm not clear why or what you think another variable will help capture better? Is there a lifecycle that you want to try to catch before it is closed?

I noticed that there's a potential race condition in channel.received where we are using _flush_some to write out a response. The underlying assumption in that block of code is that if we're reading, then no requests are active, but of course this PR breaks that assumption.

viktordick commented 4 years ago

Thanks @viktordick for going through this, I think I agree with most of your analysis. A channel is never reused so there is no concern about the buffer cleanup or connected getting reset to True. But why do we then need close_when_flushed and will_close? ClientDisconnected is raised if the task tries to write part of the response into the output buffers and the client has disconnected. But why is service then going through all the steps instead of simply setting connected to False, emptying self.requests and returning? Even if handle_close should be called from the main thread instead of the task thread (which is the case I guess), this only explains the existence of will_close, not of close_when_flushed - I think.

I could imagine some sort of graceful shutdown which might use close_when_flushed to process all currently running tasks, but not accept any more. However, I do not see any code that might implement something like this.

I noticed that there's a potential race condition in channel.received where we are using _flush_some to write out a response. The underlying assumption in that block of code is that if we're reading, then no requests are active, but of course this PR breaks that assumption. Thanks, I will check this out

mmerickel commented 4 years ago

But why do we then need close_when_flushed and will_close

So there's a bunch of reasons to close that are triggered via client, wsgi app, or process:

I could imagine some sort of graceful shutdown which might use close_when_flushed to process all currently running tasks, but not accept any more. However, I do not see any code that might implement something like this.

This is sort of supported right now via the Channel.cancel method which is invoked from the threadpool (TaskDispatcher). However it might not be as graceful as you'd expect. Right now it stops ASAP versus after draining the pending requests. I think as we get further into supporting a real graceful shutdown that behavior would change.

mmerickel commented 4 years ago

Also yes, will_close is being used to ensure we are invoking handle_close from within the event loop and not from the channel's thread. Like I said above I think that its meaning would change if we added a better graceful shutdown. Likely to something closer to "close after flushing the current list of requests plus one we might be reading right now, and do not start reading any new ones".

viktordick commented 4 years ago

I secured the writing into the outbufs in reeceived with the outbuf_lock, which seems to work well.

I'd say this PR is ready for a re-review. One check seems to be stuck - I had this problem with some other PR on github once and it seemed to be necessary to close and re-open the PR to get the tests to run again. Should I do this here also?

digitalresistor commented 4 years ago

Maintainers have the ability to hit the re-run all jobs button. So no need to close/open this PR.

mmerickel commented 4 years ago

I had only a couple items to note in my review - this is looking great and getting really close.

viktordick commented 4 years ago

I finished implementing a send_continue that is fired either while reading a request (but only if there are no requests currently queued or running) or later (if the last queued request is finished and there is an incomplete request that expects a continue). That way, we can continue to keep the socket readable and are able to detect client disconnects even if we are in the middle of a expect-continue.

This takes advantage of the fact that a server may omit the 100-continue response if it has already read some of the body of the corresponding request (I read this in the specs).

I am not sure why send_continue marks the request as finished - it already did that before, but I do not think that this is correct.

The tests ran fine on my system, but it seems one test failed on github. I guess I will have to take another look at that.

digitalresistor commented 4 years ago

This takes advantage of the fact that a server may omit the 100-continue response if it has already read some of the body of the corresponding request (I read this in the specs).

Where did you read this?

The client hasn't sent a body at this point, and is waiting for the server to respond. Clients implement a fallback that will send the body anyway (sometimes) if no response comes back, but it is absolutely expected that a server sends a 100 continue.

viktordick commented 4 years ago

Where did you read this?

I read it here:

An origin server MAY omit a 100 (Continue) response if it has already received some or all of the request body for the corresponding request.

Not sure though if this also holds for HTTP/1.0

viktordick commented 4 years ago

Alright I was confused regarding the line request.completed = False, somehow I mistook it for saying request.completed = True. It was somewhat late... Anyhow, I added a test that a request that uses 100-continue actually receives the body of the request. I also read here that HTTP/1.0 does not support 100-continue (actually none of the 1xx status code are yet defined in 1.0), so I think this is now ready for a review.

@bertjwregeer I hope I have read the specification correctly regarding omitting of the 100-continue response if a body was already transmitted. If you know of some problem in any browser or proxy or other client with this implementation, we should take that into account, but without any concrete problem I guess we should be safe by relying on the specs.

Anyway, there seems to be one flaky test, namely test_request_body_too_large_with_no_cl_http10 in test_functional.py under pypy3. Sometimes the exception ConnectionClosed is not raised. But I am not sure if that is a new problem, I think I also saw it once on the master branch.

digitalresistor commented 4 years ago

Okay, looking at https://tools.ietf.org/html/rfc7231#section-5.1.1 it does look like the server can ignore sending the 100 Continue if a request body has already been sent by the client, but the standard does say that a client will wait before sending a request body if it has sent an Expect 100 continue request in the first place.

So this may work for the second or third request in a pipelined setup, but we should endeavour to send a reply back as soon as possible.

viktordick commented 4 years ago

So this may work for the second or third request in a pipelined setup, but we should endeavour to send a reply back as soon as possible.

Yes, that is what the code does now. If there are no other request currently being executed or queued to be executed, the 100-continue is sent as soon as the request that expects it finished submitting the header (as before). Only if it is a request that is pipelined after other requests which are still being executed, the 100-continue is delayed until all tasks before it have finished, which is necessary to keep the order of responses correct.

So the 100-continue is always written as soon as it is possible to write it without messing up the order of the responses.

mmerickel commented 4 years ago

@viktordick this is awesome, in my mind the only thing missing is a changelog entry!

viktordick commented 4 years ago

@mmerickel Thanks! I wrote a changelog entry and squashed the commits.

@bertjwregeer Are you also happy with the result? There is still a change request from you recorded on the PR, but I think everything mentioned there has been addressed in the mean time.

digitalresistor commented 4 years ago

Thanks for being patient and working through this. I want to read this critically one last time and make sure that I've convinced myself that there are no locking issues/race conditions.

Thanks for all your hard work!

digitalresistor commented 4 years ago

Thanks so much @viktordick for your patience and working through this. This looks great, and I have adequately convinced myself that this won't break!

viktordick commented 4 years ago

Thanks a lot for the help and patience!

Is a release already planned sometime soon?

digitalresistor commented 3 years ago

@viktordick https://pypi.org/project/waitress/2.0.0b0/

I need testers to validate this doesn't break anything, but there is now a released version of waitress you can try in your environment.

viktordick commented 3 years ago

Thanks. We are actually already using it - we simply use

git+https://github.com/perfact/waitress@2.0.0dev0.perfact.2

In our requirements.txt, which is essentially the current state of upstream master plus a commit that gives a custom tag. Although the number of systems that actually already use it is still small (I think no production system yet).

wondie commented 3 years ago

I couldn't use it in waitress serve in version 2.0. I get a stack trace ValueError: Unknown adjustment 'client_disconnected' when using it as serve(MyFlask.app, host="111.111.111", port=os.environ["PORT"], channel_request_lookahead=5, client_disconnected=run_on_exit). Is it yet implimented?

mmerickel commented 3 years ago

That's not how this feature works. You can't inject your own callable into the environ.

wondie commented 3 years ago

Then how is it used? I couldn't find any example on it.

mmerickel commented 3 years ago

When using waitress the wsgi environ is providing an extra key with a callable that can be invoked to see if the client is still connected. It is not a callback. You have to poll it in your request handling code.