Closed oakkitten closed 2 years ago
@oakkitten could you try this patch: https://github.com/Pylons/waitress/pull/377 and see if it stops the main thread from dying? I have been unable to come up with a good way to test/validate this fixes the issue.
It's not the best solution, but short of rewriting asyncore
it hopefully provides relief as we just try again.
I'm not familiar with the codebase, but from what I've seen while trying to catch the issue—
Perhaps, instead of calling del_channel
, which mutates map
, and then closing the socket in a worker thread, it would be more simple to just flag the channel for closing? Like, you set a flag on a channel, such as must_be_closed_in_the_main_loop
, or perhaps have a separate list of objects that must be closed (lists are said to be thread-safe), and then the main loop checks if anything should be closed before doing select()
. This way the code would be a bit easier to reason about I think.
Also, you are doing
if fds != map.keys():
fds = map.keys()
dict.keys()
returns not a copy of keys, but a special dict_keys
object that is a view of dictionary keys that changes along with the dictionary. So if read this right, fds == map.keys()
will always be true.
Also, there's the issue with the modification of map
itself. While it never blew up in my face during tests, it probably can, since when you modify a dictionary while iterating you get this:
>>> d = {x: x for x in range(5)}
>>> for x in d.keys():
... print(x)
... if x == 3:
... del d[x]
...
0
1
2
3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: dictionary changed size during iteration
I suppose even doing list(map.keys())
is dangerous.
Good point on the dict.keys()
I forgot about that change in Py3.
We never iterate the map and delete from it though in the code change I made.
I am trying to avoid changing too much of asyncore as that was wholesale lifted from Py 3.5.
If you were to use poll()
instead of select()
I would imagine the problem would be masked because poll()
will simply return that there is an error on the socket, whereas select()
fails and raises OSError
. So I was attempting to make select()
act more like poll()
in this situation.
This only really happens if the remote closes the connection before having read the full response, which is likely why it is not often seen in the wild, as most reverse proxies will have already read the full response.
The problem with attempting to add yet another list, is that there is no single object, the map
is passed down from the serve
function, and is passed around, but to add another list that would mean adding another variable that is passed around to the various functions as the loop()
function does not currently have that.
Trying to set a flag inside the thread and trying to make sure that can be read throughout the asyncore
functions so that it doesn't close the sockets was something I looked at, but it is difficult as there are various places where asyncore
expects to be able to close the socket.
It's going to require more intrusive changes, which makes sense as asyncore
was never really meant to be thread safe. This is a consequence of trying to send data faster without waking the main thread up.
Using list(map.keys())
should be fine. Since the only reason select()
can fail is if it has a file descriptor that is no longer valid (and thus no longer in the map). Comparing the original map we used for select()
against the new map.keys()
should show that there is a difference, at that point we just start the select()
call again with the new list of file descriptors.
Eventually a call to select()
will succeed.
I am going to see if I can write a test that causes the race.
You mentioned you are seeing this in tests, are you running waitress in a thread or a separate process?
You don't iterate the map and delete from it in the code change you made, but I think you might be doing that in the already existing code?
list(map.items())
. close()
calls del_channel()
, and del_channel()
says del map[fd]
. This can (and will) happen during the above list()
call.So the question is, is list()
doing or calling anything that touches GIL in some kind of a guaranteed way. Well, I don't know, really. I asked on #python if it will call some magic thread-safe method for converting keys to a list and no-one was aware of such a method. And I don't see the documentation mention anything like that. Even if this is safe and indeed not an implementation detail, I know I wouldn't want to rely on this behavior, anyway. This is too complicated!
I am testing an addon for a Qt app that uses waitress internally. It launches it in a thread. This is a patch I made to fix the issue in a dumb way, which should be ok for tests. If you are willing, you can try grabbing the commit before that and running the tests (see tox.ini for instructions; the tests are slow and require a lot of dependencies so it won't be fun). The problem is rare, but you can force it by inserting some kind of a delay before the select
call, then it manifests right away.
I updated https://github.com/Pylons/waitress/pull/377#issuecomment-1136651592 with the new changes. Give this a shot please.
It removes any races, only the main thread can close the socket...
I pulled 4f6789b035610e0552738cdc4b35ca809a592d48 and 4800 tests and 108 minutes later there were no crash. Thanks! 🎉
I spot that the description for CVE-2022-31015 mentions that this affects "versions 2.1.0 and 2.1.1". However, a quick glance at the code suggests that this might be because it affects the wasyncore
module which was vendored in version 1.2. Does this issue, then, affect these older versions as well? Thanks. :)
No. It’s not a bug in wasyncore but rather waitress began trying to invoke wasyncore methods like close() from other threads that caused the issue.
Getcha. However, the fix essentially requires the vendored version, no? Otherwise the do_close
thing can't be passed around? :)
It enables a performance optimization where waitress can write to the socket safely from a thread.
@lamby the vendored version is always used, even on Python versions that have asyncore
support built-in. Waitress does not attempt to use the non-vendored version and then fall back to the vendored version.
That makes sense. However, someone using a very very old version of waitress (prior to the module being vendored in, that is...) would be vulnerable to this issue?
Again no because the bug was only due to a change in how we USED asyncore in the specified versions of waitress. Waitress used it differently and safely before then. And it does again after the cve fix. We documented the affected versions correctly in the cve.
Thanks, really appreciate it. :)
I am facing this as well:
import waitress
from flask import Flask
app = Flask(__name__)
wsgi = waitress.create_server(app, host='0.0.0.0', port='1234')
wsgi.run()
While the main loop is running, I am calling wsgi.close()
from elsewhere, which yields:
Traceback (most recent call last):
File "/myPath/./main.py", line 1368, in main
self.wsgi.run()
File "/myPath/lib/python3.11/site-packages/waitress/server.py", line 322, in run
self.asyncore.loop(
File "/myPath/lib/python3.11/site-packages/waitress/wasyncore.py", line 245, in loop
poll_fun(timeout, map)
File "/myPath/lib/python3.11/site-packages/waitress/wasyncore.py", line 172, in poll
r, w, e = select.select(r, w, e, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 9] Bad file descriptor
Is there a recommended way of preventing this or handling it in a more clean way? Thanks! I'm using
python3.11
waitress2.1.2
I just might be wrong because if this indeed a race condition it should be breaking more things. Anyway, I got this exception (line numbers might be wrong due to debug statements):
This error was extremely rare but since I was getting it while running tests I could just run a lot of them until one failed, which I did, and I think the problem is a follows.
First, thread
Thread-1
that the app I'm testing is launching, one that runs waitress server, assembles the descriptor lists forselect
: https://github.com/Pylons/waitress/blob/603d2c12ce09fcd1be0b7a5734b4fca339820286/src/waitress/wasyncore.py#L154-L166Then, thread
waitress-0
deletes one of the channels, in my case it was<waitress.channel.HTTPChannel 127.0.0.1:54044 at 0x7f10ec052400>
, and immediately closes the socket: https://github.com/Pylons/waitress/blob/603d2c12ce09fcd1be0b7a5734b4fca339820286/src/waitress/wasyncore.py#L460-L470Stack of
waitiress-0
at the moment:Then, thread
Thread-1
is trying to see if the file descriptor of the socked closed above is writable, which leads to the the exception above: https://github.com/Pylons/waitress/blob/603d2c12ce09fcd1be0b7a5734b4fca339820286/src/waitress/wasyncore.py#L171-L177Python 3.8.10, waitress 2.1.1, Ubuntu 20.04 LTS focal @ WSL2