Closed dsuch closed 10 years ago
I'm not sure if this is the same situation, but on Amazon EC2, I run into SSL handshake issues with their Health Check system:
Traceback (most recent call last):
File ".../gunicorn/workers/sync.py", line 92, in handle
req = six.next(parser)
File ".../gunicorn/http/parser.py", line 39, in __next__
self.mesg = self.mesg_class(self.cfg, self.unreader, self.req_count)
File ".../gunicorn/http/message.py", line 152, in __init__
super(Request, self).__init__(cfg, unreader)
File ".../gunicorn/http/message.py", line 49, in __init__
unused = self.parse(self.unreader)
File ".../gunicorn/http/message.py", line 164, in parse
self.get_data(unreader, buf, stop=True)
File ".../gunicorn/http/message.py", line 155, in get_data
data = unreader.read()
File ".../gunicorn/http/unreader.py", line 38, in read
d = self.chunk()
File ".../gunicorn/http/unreader.py", line 65, in chunk
return self.sock.recv(self.mxchunk)
File "/usr/lib/python2.7/ssl.py", line 241, in recv
return self.read(buflen)
File "/usr/lib/python2.7/ssl.py", line 160, in read
return self._sslobj.read(len)
SSLError: [Errno 1] _ssl.c:1413: error:140780E5:SSL routines:SSL23_READ:ssl handshake failure
This seems to be mitigated when I set do_handshake_on_connect=False
.
However, I still get these:
2013-10-21 12:33:00 [29112] [CRITICAL] WORKER TIMEOUT (pid:29212)
2013-10-21 12:33:00,197 gunicorn.error [CRITICAL] glogging.py 204 WORKER TIMEOUT (pid:29212)
Note that while I see these errors in the log files, in practice, the system runs fine – which is a little odd.
Hi @malthe - you say it seems to go away after setting 'do_handshake_on_connect=False' which is a default value.
I'm not sure if it means you first deleted this argument like I suggested, then noticed an issue so you reverted it to the original value?
In the sources, this value is set to False
, but the default value is True
. I removed the keyword argument from my local gunicorn source.
I removed the keyword argument from my local gunicorn source.
OK, I get it - so you removed the argument, the SSL handshake failure appeared and then you set it back to False, as gunicorn had it originally. Is that correct?
I'm wondering if you did any other changes perhaps, like requiring client certificates? (Probably not as you would've mentioned it)
No, it's the other way around :-o
If I leave the option set as gunicorn has it by default (which is False
for no handshake on connect), then I get the SSL handshake error.
But If I change it to True
(which is the same as leaving the argument out), then the error goes away. In other words, the default behavior causes problems for me.
Whether this is related to the fact that gunicorn routinely kills idle workers even when a constant stream of requests (one per second) is coming in, I have not been able to determine. This seems like a bug to me. Either some of the workers are not being given any work and therefore become idle, or there is a problem with the way their idleness is checked:
if time.time() - worker.tmp.last_update() <= self.timeout:
...
I also see this entry in the logging output – and it seems to be delayed compared to the request/response flow:
2013-10-21 13:24:09 [30916] [DEBUG] ssl connection closed
It's almost as if the connection is being kept open for longer than it needs to; and this is what causes the worker to time out, too.
OK, @malthe so I think this isn't something that I originally opened this ticket for because I had client certificates in mind.
That said, '[DEBUG] ssl connection closed' is a gunicorn message which happens when Python's ssl module raises SSL_ERROR_EOF/PY_SSL_ERROR_EOF error and that in turn has a textual representation of 'EOF occurred in violation of protocol' which seems a fairly popular message on StackOverflow and elsewhere.
This is a generic 'something went wrong' message but maybe you can have a look at the suggestions on the web?
Definitely. I'll try and have a look at that. It's kind of odd though that the client doesn't complain; it gets the right response. It's gunicorn which is unhappy with what happens on the socket.
It's kind of odd though that the client doesn't complain; it gets the right response. It's gunicorn which is unhappy with what happens on the socket.
You mention an Amazon's service so I'm not sure if you have access to client's logs?
What could be happening is that the client OS was forcibly closing connections after some inactivity. If you notice that it always happens say, after 180 seconds from the initial connection, this would probably confirm it.
I don't have access to the logs on that service (elastic load balancer). It seems that the timeout happens 10 seconds after the GET
request:
2013-10-21 14:45:58,776 gunicorn.access [INFO] __init__.py 72 10.0.1.214 - - [21/Oct/2013:14:45:58] "GET / HTTP/1.1" 200 26 "-" "ELB-HealthChecker/1.0" ""
2013-10-21 14:46:09,492 gunicorn.error [DEBUG] glogging.py 216 ssl connection closed
And for each request, I get these two about three seconds later:
2013-10-21 14:46:01 [30634] [CRITICAL] WORKER TIMEOUT (pid:31468)
2013-10-21 14:46:01,659 gunicorn.error [CRITICAL] glogging.py 204 WORKER TIMEOUT (pid:31468)
2013-10-21 14:46:01 [30634] [CRITICAL] WORKER TIMEOUT (pid:31468)
2013-10-21 14:46:01,694 gunicorn.error [CRITICAL] glogging.py 204 WORKER TIMEOUT (pid:31468)
It then boots a new worker.
But why is it a critical error that a worker timed out?
Should I open a new bug / issue for that handshake boolean parameter? I don't understand why it's necessary, but it seems that it fixes a part of the handshake process in this particular environment.
@malthe - I'm not the author of gunicorn so I don't know all its inner details but I picture a scenario like that
Now, why would setting do_handshake_on_connect to True help in your case is something I don't know but I do know that if do_handshake_on_connect is False then a non-blocking socket itself should perform the handshake. gunicorn doesn't do it by default so it seems natural that there's this error you initially spotted.
As for why the error is critical - we'd have to wait for someone from gunicorn dev team to say. Ditto for opening a new ticket, I'd say there's no need because this one already exists but I'm not a project's maintainer.
OK, I'm starting a project that will make use of what we discuss here. The project will also make use of the additional options regarding TLS I added in the fork over here
https://github.com/dsuch/gunicorn/commit/16f9974
The go-live is within 2-3 months from now but I need it already during development and tests so I'll just starting using it all without a full Gunicorn release.
Sorry got side tracked. @dsuch can you make a PR, it will help to handle such changes. Also if you have any way to test it it would be appreciated a lot :)
applied myself to the master in 5fb61cb841068881f65e8fa2f750596cbaf2a48f . Thanks for the patch.
Just as a datapoint, on AWS with Elastic Load Balancer, Gunicorn's sync
worker still does not work us – even with --do-handshake-on-connect
.
There are two issues:
SSLError: [Errno 1] _ssl.c:1413: error:140940E5:SSL routines:SSL3_READ_BYTES:ssl handshake failure
With --do-handshake-on-connect
, (2) seems to go away, or at least become less prevalent. But (1) persists.
The fix for us was to simply use the eventlet
worker. While we get the exceptions in (2) when requests come from Aamzon's ELB, they don't cause problems.
@malthe what is the ELB configuration?
@benoitc what part of it do you mean specifically? Also, note that we eventually decided on uwsgi
(because of this issue mostly) and no longer use gunicorn
so I am not really able to follow up much on this issue.
@malthe I didn't get the notification on the closed issue, sorry for that. (and a good reason to tell in the contributing doc to only link to closed issue but opening a new one).
About waiting to read more on the SSL sockets, did you configured keepalive? was connection draining enabled?Using the eventlet client to solve the issue could mean indeed that was a keepalive issue.
It might simply be a keep-alive timeout mismatch that causes gunicorn
to slowly drain on available connections such that eventually, no new connections can be made.
OK, thanks anyway and sorry for my very late answer again
On Wed, May 20, 2015 at 2:31 PM Malthe Borch notifications@github.com wrote:
It might simply be a keep-alive timeout mismatch that causes gunicorn to slowly drain on available connections such that eventually, no new connections can be made.
— Reply to this email directly or view it on GitHub https://github.com/benoitc/gunicorn/issues/617#issuecomment-103864370.
Hi @dsuch ,
I am trying to access the client cert so that I can load it and verify a few things. I know that gunicorn can do basic handshake stuff by verifying the signing authority on the client cert but I want to open it inside my python app. My set up is gunicorn server + flask app
Let me know if you have an idea.
Thanks, Pankit
Hi @pankit - I have not used this part of gunicorn in quite a while but essentially, when you accept an HTTP request, there should be a WSGI key called gunicorn.socket
or in a similar manner (sorry, I really forget).
This key points to the TLS socket that accepted the connection, i.e. this is the object that is returned by ssl.wrap_socket
so you can call .getpeercert()
on it to obtain the client certificate.
That's how it worked like the last time I used it which was some two years ago around September 2013.
Are you pointing to gunicorn.sock?
@pankit - I honestly can't recall its name. I remember the principle but not any exact details, it was two years ago. Thanks.
Thanks! Not a problem. I l try to find it. But ultimately you were able to make it work. right? Were you also using flask app?
Yes, I was able to make it work without problems. Ultimately for reasons unrelated I went a different route but I even found a piece of code that read TLS certificates from client connections.
I wasn't using Flask but it doesn't matter. The certificate is part of the WSGI environment, it's just a dictionary, no matter the framework.
Just putting for someone who has similar doubt in future. The way I got hold of the socket object is the same as pointed out by @dsuch I was using flask + gunicorn Here is code snippet @app.before_request def checkssl(): sock = request.environ['gunicorn.socket'] cert = sock.getpeercert(binary_form=True) x509 = crypto.load_certificate(crypto.FILETYPE_ASN1, cert)
Now you can use X509 object to validate whatever you need. :)
Hello,
I'm working on a feature to make it possible to easily access client certificates in gunicorn-based servers but there is one thing I believe is a bug that was not unearthed before.
In GeventWorker.run, when creating ssl_args, there is explicit
However, a couple of lines later there is
They can't really be used together. do_handshake_on_connect=False means that your socket is non-blocking and you will call do_handshake on a socket later on.
http://docs.python.org/2.7/library/ssl.html#ssl.wrap_socket
But this is a blocking socket and do_handshake is never called.
So, when in the feature I'm working on I make a call to sock.getpeercert(True), the certificate is always None, as reported in Python tracker
http://bugs.python.org/issue19095
Hence I propose doing anyway with explicitly assigning do_handshake_on_connect=False - it was probably part of some other code originally and should never be used in this place, if I'm not mistaken?
Many thanks!