Closed viktordick closed 4 years ago
@bertjwregeer Is there a chance that this PR will be accepted if I fix the tests and add new ones for the new feature? Or do you think our request is too exotic and the added complexity should be avoided?
@bertjwregeer Thanks for the review.
Regarding 3: I guess there are three possibilities. A) the application inserts a callable into the environment that waitress then calls (in the main thread, not in the worker thread!). B) Waitress inserts a flag into the environment once the connection is closed and the application checks for it. C) Waitress inserts a callable into the environment that the application can use to check for client disconnection.
I was thinking that A) would be easiest to deal with on the application side, especially if we are in a situation where we spawned another process or are currently waiting for an SQL database query to be performed. When triggered, we could then terminate the spawned process or the database worker process.
However, this has the mentioned disadvantage that the handler interrupts the main thread of waitress. And maybe it also adds unnecessary complexity (maybe the additional lock was necessary due to this approach and is not required with C, but I am not sure yet). I will try to think about this a bit more.
Regarding 1:
I think the additional lock is necessary because handle_read
is running in the main thread while service
is running in the task thread and before this change, they never were executed simultaneously for the same channel (while there was a request being processed, readable
always returned False
). I will try to think a little bit more about this - maybe by switching the strategy to C it is not necessary. But I do not think it is related to the strategy, it is simply because we have one thread that might create new requests (the main thread) and one that is processing requests, and they now may be running simultaneously which they did not do before.
Regarding 2:
I was not sure how to implement a flag that may be set using PasteDeploy
before, so in this draft I used the global variable max_requests_per_channel
, which should restore the default functionality if set to 0
(I guess the name should be changed to something like max_lookahead_requests
). If I understand correctly, I can set this as one of the parameters to waitress.serve
. By giving it a default value of 0 there, this should restore the default functionality. I will try it.
Addendum: Regarding the difference between strategies B and C, I guess with C we do not need to make sure to catch the environment of the currently running task as well as any possibly queued tasks. However, I guess we anyhow do not want to start any more tasks if the client disconnected, do we? In any case, C seems to be superior to B. (C was also the strategy that @mmerickel suggested).
@bertjwregeer I tried to implement the requested changes. There is still a lot to be done regarding tests and documentation, but is the general idea now something that might be accepted?
Now that I looked over it again, the additional lock I had in the original attempt was due to race conditions with self.current_task
, which was needed in order to implement strategy A as mentioned above. It is no longer needed with C (the remaining race condition is between received()
adding requests to be processed and service()
removing them, but self.requests.extend(requests)
is already protected by the GIL).
@bertjwregeer I changed the implementation so service()
only processes one request and re-inserts itself using add_task
if there are more requests. This requires a lock to cleanly communicate if the main thread should call add_task
or if the task thread should do it and prevent race conditions. However, I am pretty sure that I could re-use the task_lock
for this. It is no longer necessary to guard the complete service
method with this lock since by only re-inserting the next task at the very end of the function, it is ensured that no two requests of the same channel are processed in the wrong order or in parallel.
However, this makes the difference to master
rather large since most of the content in service()
changed the indentation level.
One change I am not sure about is the handling of the output buffers. Due to the change, they are flushed after each request separately. Before, if the client sent multiple short requests in a pipeline such that they would be picked up with a single read()
, the output buffers might only be flushed once all of them were processed (I think).
I did some obvious adjustments to the tests. I also dropped test_service_no_requests
since it does not work anymore (service()
assumes that there is a request to be processed), but it anyhow does not handle any real life situation, add_task
is only called if there is a request to be processed.
There are two more failing tests that I will check next. And, of course, I need to add tests for the new functionality.
@viktordick thanks for updating it, I haven't taken a look yet but I saw that @mmerickel provided some feedback. I've been busy moving, hopefully once things things calm down a little bit I'll get back to this and help figure out a best path forward.
@bertjwregeer, @mmerickel So I tried to read through the code in order to get a feeling of when each variable is used and came up with the following:
task.close_on_finish
ClientDisconnected
during trying to write part of the response if the socket is closed or if the task fails to write a header.close_when_flushed
and closes all pending requests after processing the current taskclose_when_flushed
handle_write
, if there are no more buffers to be flushed or if there is an error, it sets will_close
will_close
handle_write
a call to handle_close
, which cleans up buffers (these might even be on disk)handle_close
sets connected = False
after everything is cleaned upconnected
handle_write
if False
False
inside functions that might be called from within the task thread, it raises ClientDisconnected
True
inside channel
, only in wasyncore.dispatcher
when the channel is recycled for a new connectionIn summary: connected
is the signal that everything is closed and finished and nothing should be done on the channel until the dispatcher reuses it for a new connection. Using connected
directly from the main thread on client disconnection might fail to clean up output buffers and depend on the garbage collector to do this, which might miss buffers that have overflown into files. Worse, it might contaminate the next connection that reuses the same Channel
instance, although I have not dug this deep into the code to verify this.
Even though the tests were successful once I merged client_disconnected
and not connected
, I guess it is better to restore the separation? Maybe under another name and with additional comments explaining the difference.
Curiously, the tests were successful after I pushed 41155fc, but after I changed one comment that was not yet up-to-date, a test failed. There seems to be a race condition that I have been unable to reproduce until now. Not sure if it was already there before or if the problem is related to the handling of connected
.
Could any of you please confirm my conclusion or show me where I am mistaken?
Thanks @viktordick for going through this, I think I agree with most of your analysis. A channel is never reused so there is no concern about the buffer cleanup or connected
getting reset to True
.
From what I'm seeing, connected
is the right hook to use here to indicate that the connection is dead and the client is gone. This is what your current code is using. It sounds like this is a concern for you but I'm not clear why or what you think another variable will help capture better? Is there a lifecycle that you want to try to catch before it is closed?
I noticed that there's a potential race condition in channel.received
where we are using _flush_some
to write out a response. The underlying assumption in that block of code is that if we're reading, then no requests are active, but of course this PR breaks that assumption.
Thanks @viktordick for going through this, I think I agree with most of your analysis. A channel is never reused so there is no concern about the buffer cleanup or
connected
getting reset toTrue
. But why do we then needclose_when_flushed
andwill_close
?ClientDisconnected
is raised if the task tries to write part of the response into the output buffers and the client has disconnected. But why isservice
then going through all the steps instead of simply settingconnected
toFalse
, emptyingself.requests
and returning? Even ifhandle_close
should be called from the main thread instead of the task thread (which is the case I guess), this only explains the existence ofwill_close
, not ofclose_when_flushed
- I think.
I could imagine some sort of graceful shutdown which might use close_when_flushed
to process all currently running tasks, but not accept any more. However, I do not see any code that might implement something like this.
I noticed that there's a potential race condition in
channel.received
where we are using_flush_some
to write out a response. The underlying assumption in that block of code is that if we're reading, then no requests are active, but of course this PR breaks that assumption. Thanks, I will check this out
But why do we then need close_when_flushed and will_close
So there's a bunch of reasons to close that are triggered via client, wsgi app, or process:
close_when_flushed = True
)close_when_flushed = True
)connected = False
)Channel.cancel
/ will_close = true
)I could imagine some sort of graceful shutdown which might use close_when_flushed to process all currently running tasks, but not accept any more. However, I do not see any code that might implement something like this.
This is sort of supported right now via the Channel.cancel
method which is invoked from the threadpool (TaskDispatcher). However it might not be as graceful as you'd expect. Right now it stops ASAP versus after draining the pending requests. I think as we get further into supporting a real graceful shutdown that behavior would change.
Also yes, will_close
is being used to ensure we are invoking handle_close
from within the event loop and not from the channel's thread. Like I said above I think that its meaning would change if we added a better graceful shutdown. Likely to something closer to "close after flushing the current list of requests plus one we might be reading right now, and do not start reading any new ones".
I secured the writing into the outbufs
in reeceived
with the outbuf_lock
, which seems to work well.
I'd say this PR is ready for a re-review. One check seems to be stuck - I had this problem with some other PR on github once and it seemed to be necessary to close and re-open the PR to get the tests to run again. Should I do this here also?
Maintainers have the ability to hit the re-run all jobs button. So no need to close/open this PR.
I had only a couple items to note in my review - this is looking great and getting really close.
I finished implementing a send_continue
that is fired either while reading a request (but only if there are no requests currently queued or running) or later (if the last queued request is finished and there is an incomplete request that expects a continue). That way, we can continue to keep the socket readable and are able to detect client disconnects even if we are in the middle of a expect-continue.
This takes advantage of the fact that a server may omit the 100-continue response if it has already read some of the body of the corresponding request (I read this in the specs).
I am not sure why send_continue
marks the request as finished - it already did that before, but I do not think that this is correct.
The tests ran fine on my system, but it seems one test failed on github. I guess I will have to take another look at that.
This takes advantage of the fact that a server may omit the 100-continue response if it has already read some of the body of the corresponding request (I read this in the specs).
Where did you read this?
The client hasn't sent a body at this point, and is waiting for the server to respond. Clients implement a fallback that will send the body anyway (sometimes) if no response comes back, but it is absolutely expected that a server sends a 100 continue.
Where did you read this?
I read it here:
An origin server MAY omit a 100 (Continue) response if it has already received some or all of the request body for the corresponding request.
Not sure though if this also holds for HTTP/1.0
Alright I was confused regarding the line request.completed = False
, somehow I mistook it for saying request.completed = True
. It was somewhat late...
Anyhow, I added a test that a request that uses 100-continue actually receives the body of the request.
I also read here that HTTP/1.0 does not support 100-continue (actually none of the 1xx status code are yet defined in 1.0), so I think this is now ready for a review.
@bertjwregeer I hope I have read the specification correctly regarding omitting of the 100-continue response if a body was already transmitted. If you know of some problem in any browser or proxy or other client with this implementation, we should take that into account, but without any concrete problem I guess we should be safe by relying on the specs.
Anyway, there seems to be one flaky test, namely test_request_body_too_large_with_no_cl_http10
in test_functional.py
under pypy3
. Sometimes the exception ConnectionClosed
is not raised. But I am not sure if that is a new problem, I think I also saw it once on the master branch.
Okay, looking at https://tools.ietf.org/html/rfc7231#section-5.1.1 it does look like the server can ignore sending the 100 Continue if a request body has already been sent by the client, but the standard does say that a client will wait before sending a request body if it has sent an Expect 100 continue request in the first place.
So this may work for the second or third request in a pipelined setup, but we should endeavour to send a reply back as soon as possible.
So this may work for the second or third request in a pipelined setup, but we should endeavour to send a reply back as soon as possible.
Yes, that is what the code does now. If there are no other request currently being executed or queued to be executed, the 100-continue is sent as soon as the request that expects it finished submitting the header (as before). Only if it is a request that is pipelined after other requests which are still being executed, the 100-continue is delayed until all tasks before it have finished, which is necessary to keep the order of responses correct.
So the 100-continue is always written as soon as it is possible to write it without messing up the order of the responses.
@viktordick this is awesome, in my mind the only thing missing is a changelog entry!
@mmerickel Thanks! I wrote a changelog entry and squashed the commits.
@bertjwregeer Are you also happy with the result? There is still a change request from you recorded on the PR, but I think everything mentioned there has been addressed in the mean time.
Thanks for being patient and working through this. I want to read this critically one last time and make sure that I've convinced myself that there are no locking issues/race conditions.
Thanks for all your hard work!
Thanks so much @viktordick for your patience and working through this. This looks great, and I have adequately convinced myself that this won't break!
Thanks a lot for the help and patience!
Is a release already planned sometime soon?
@viktordick https://pypi.org/project/waitress/2.0.0b0/
I need testers to validate this doesn't break anything, but there is now a released version of waitress you can try in your environment.
Thanks. We are actually already using it - we simply use
git+https://github.com/perfact/waitress@2.0.0dev0.perfact.2
In our requirements.txt
, which is essentially the current state of upstream master
plus a commit that gives a custom tag. Although the number of systems that actually already use it is still small (I think no production system yet).
I couldn't use it in waitress serve in version 2.0. I get a stack trace ValueError: Unknown adjustment 'client_disconnected'
when using it as serve(MyFlask.app, host="111.111.111", port=os.environ["PORT"], channel_request_lookahead=5, client_disconnected=run_on_exit)
. Is it yet implimented?
That's not how this feature works. You can't inject your own callable into the environ.
Then how is it used? I couldn't find any example on it.
When using waitress the wsgi environ is providing an extra key with a callable that can be invoked to see if the client is still connected. It is not a callback. You have to poll it in your request handling code.
Implementation of the idea discussed in #308. The tests don't pass yet and I should also implement additional tests for this, but hopefully this helps with the discussion.