Don't use do_handshake_on_connect=False for SSL sockets

dsuch commented 11 years ago

Hello,

I'm working on a feature to make it possible to easily access client certificates in gunicorn-based servers but there is one thing I believe is a bug that was not unearthed before.

In GeventWorker.run, when creating ssl_args, there is explicit

do_handshake_on_connect=False

However, a couple of lines later there is

for s in self.sockets:
    s.setblocking(1)

They can't really be used together. do_handshake_on_connect=False means that your socket is non-blocking and you will call do_handshake on a socket later on.

http://docs.python.org/2.7/library/ssl.html#ssl.wrap_socket

But this is a blocking socket and do_handshake is never called.

So, when in the feature I'm working on I make a call to sock.getpeercert(True), the certificate is always None, as reported in Python tracker

http://bugs.python.org/issue19095

Hence I propose doing anyway with explicitly assigning do_handshake_on_connect=False - it was probably part of some other code originally and should never be used in this place, if I'm not mistaken?

Many thanks!

malthe commented 11 years ago

I'm not sure if this is the same situation, but on Amazon EC2, I run into SSL handshake issues with their Health Check system:

Traceback (most recent call last):
  File ".../gunicorn/workers/sync.py", line 92, in handle
    req = six.next(parser)
  File ".../gunicorn/http/parser.py", line 39, in __next__
    self.mesg = self.mesg_class(self.cfg, self.unreader, self.req_count)
  File ".../gunicorn/http/message.py", line 152, in __init__
    super(Request, self).__init__(cfg, unreader)
  File ".../gunicorn/http/message.py", line 49, in __init__
    unused = self.parse(self.unreader)
  File ".../gunicorn/http/message.py", line 164, in parse
    self.get_data(unreader, buf, stop=True)
  File ".../gunicorn/http/message.py", line 155, in get_data
    data = unreader.read()
  File ".../gunicorn/http/unreader.py", line 38, in read
    d = self.chunk()
  File ".../gunicorn/http/unreader.py", line 65, in chunk
    return self.sock.recv(self.mxchunk)
  File "/usr/lib/python2.7/ssl.py", line 241, in recv
    return self.read(buflen)
  File "/usr/lib/python2.7/ssl.py", line 160, in read
    return self._sslobj.read(len)
SSLError: [Errno 1] _ssl.c:1413: error:140780E5:SSL routines:SSL23_READ:ssl handshake failure

This seems to be mitigated when I set do_handshake_on_connect=False.

However, I still get these:

2013-10-21 12:33:00 [29112] [CRITICAL] WORKER TIMEOUT (pid:29212)
2013-10-21 12:33:00,197 gunicorn.error [CRITICAL]  glogging.py 204 WORKER TIMEOUT (pid:29212)

Note that while I see these errors in the log files, in practice, the system runs fine – which is a little odd.

dsuch commented 11 years ago

Hi @malthe - you say it seems to go away after setting 'do_handshake_on_connect=False' which is a default value.

I'm not sure if it means you first deleted this argument like I suggested, then noticed an issue so you reverted it to the original value?

malthe commented 11 years ago

In the sources, this value is set to False, but the default value is True. I removed the keyword argument from my local gunicorn source.

dsuch commented 11 years ago

I removed the keyword argument from my local gunicorn source.

OK, I get it - so you removed the argument, the SSL handshake failure appeared and then you set it back to False, as gunicorn had it originally. Is that correct?

I'm wondering if you did any other changes perhaps, like requiring client certificates? (Probably not as you would've mentioned it)

malthe commented 11 years ago

No, it's the other way around :-o

If I leave the option set as gunicorn has it by default (which is False for no handshake on connect), then I get the SSL handshake error.

But If I change it to True (which is the same as leaving the argument out), then the error goes away. In other words, the default behavior causes problems for me.

Whether this is related to the fact that gunicorn routinely kills idle workers even when a constant stream of requests (one per second) is coming in, I have not been able to determine. This seems like a bug to me. Either some of the workers are not being given any work and therefore become idle, or there is a problem with the way their idleness is checked:

if time.time() - worker.tmp.last_update() <= self.timeout:
    ...

I also see this entry in the logging output – and it seems to be delayed compared to the request/response flow:

2013-10-21 13:24:09 [30916] [DEBUG] ssl connection closed

It's almost as if the connection is being kept open for longer than it needs to; and this is what causes the worker to time out, too.

dsuch commented 11 years ago

OK, @malthe so I think this isn't something that I originally opened this ticket for because I had client certificates in mind.

That said, '[DEBUG] ssl connection closed' is a gunicorn message which happens when Python's ssl module raises SSL_ERROR_EOF/PY_SSL_ERROR_EOF error and that in turn has a textual representation of 'EOF occurred in violation of protocol' which seems a fairly popular message on StackOverflow and elsewhere.

This is a generic 'something went wrong' message but maybe you can have a look at the suggestions on the web?

malthe commented 11 years ago

Definitely. I'll try and have a look at that. It's kind of odd though that the client doesn't complain; it gets the right response. It's gunicorn which is unhappy with what happens on the socket.

dsuch commented 11 years ago

It's kind of odd though that the client doesn't complain; it gets the right response. It's gunicorn which is unhappy with what happens on the socket.

You mention an Amazon's service so I'm not sure if you have access to client's logs?

What could be happening is that the client OS was forcibly closing connections after some inactivity. If you notice that it always happens say, after 180 seconds from the initial connection, this would probably confirm it.

malthe commented 11 years ago

I don't have access to the logs on that service (elastic load balancer). It seems that the timeout happens 10 seconds after the GET request:

2013-10-21 14:45:58,776 gunicorn.access [INFO]  __init__.py 72 10.0.1.214 - - [21/Oct/2013:14:45:58] "GET / HTTP/1.1" 200 26 "-" "ELB-HealthChecker/1.0" ""
2013-10-21 14:46:09,492 gunicorn.error [DEBUG]  glogging.py 216 ssl connection closed

And for each request, I get these two about three seconds later:

2013-10-21 14:46:01 [30634] [CRITICAL] WORKER TIMEOUT (pid:31468)
2013-10-21 14:46:01,659 gunicorn.error [CRITICAL]  glogging.py 204 WORKER TIMEOUT (pid:31468)
2013-10-21 14:46:01 [30634] [CRITICAL] WORKER TIMEOUT (pid:31468)
2013-10-21 14:46:01,694 gunicorn.error [CRITICAL]  glogging.py 204 WORKER TIMEOUT (pid:31468)

It then boots a new worker.

But why is it a critical error that a worker timed out?

Should I open a new bug / issue for that handshake boolean parameter? I don't understand why it's necessary, but it seems that it fixes a part of the handshake process in this particular environment.

dsuch commented 11 years ago

@malthe - I'm not the author of gunicorn so I don't know all its inner details but I picture a scenario like that

An LB's health-check is performed. This is by its nature a short-lived transaction, I don't know EC2 but I imagine it's a very simple ping
EC2 closes the session incorrectly in violation of the TLS protocol. How and why is something we don't know.
On Python end this is SSL_ERROR_EOF
At this point a gunicorn worker is stuck because /perhaps/ it doesn't realize the other end is gone so it's waiting to receive more data from the socket and it's spinning in vein instead of accepting new connections
A few moments later the arbiter notices a worker is stuck so the worker is restarted

Now, why would setting do_handshake_on_connect to True help in your case is something I don't know but I do know that if do_handshake_on_connect is False then a non-blocking socket itself should perform the handshake. gunicorn doesn't do it by default so it seems natural that there's this error you initially spotted.

As for why the error is critical - we'd have to wait for someone from gunicorn dev team to say. Ditto for opening a new ticket, I'd say there's no need because this one already exists but I'm not a project's maintainer.

dsuch commented 10 years ago

OK, I'm starting a project that will make use of what we discuss here. The project will also make use of the additional options regarding TLS I added in the fork over here

https://github.com/dsuch/gunicorn/commit/16f9974

The go-live is within 2-3 months from now but I need it already during development and tests so I'll just starting using it all without a full Gunicorn release.

benoitc commented 10 years ago

Sorry got side tracked. @dsuch can you make a PR, it will help to handle such changes. Also if you have any way to test it it would be appreciated a lot :)

benoitc commented 10 years ago

applied myself to the master in 5fb61cb841068881f65e8fa2f750596cbaf2a48f . Thanks for the patch.

malthe commented 10 years ago

Just as a datapoint, on AWS with Elastic Load Balancer, Gunicorn's sync worker still does not work us – even with --do-handshake-on-connect.

There are two issues:

Stability; connections seem to hang, waiting indefinitely to read more data from the SSL socket.
Exceptions raised every so often:

SSLError: [Errno 1] _ssl.c:1413: error:140940E5:SSL routines:SSL3_READ_BYTES:ssl handshake failure

With --do-handshake-on-connect, (2) seems to go away, or at least become less prevalent. But (1) persists.

The fix for us was to simply use the eventlet worker. While we get the exceptions in (2) when requests come from Aamzon's ELB, they don't cause problems.

benoitc commented 9 years ago

@malthe what is the ELB configuration?

malthe commented 9 years ago

@benoitc what part of it do you mean specifically? Also, note that we eventually decided on uwsgi (because of this issue mostly) and no longer use gunicorn so I am not really able to follow up much on this issue.

benoitc commented 9 years ago

@malthe I didn't get the notification on the closed issue, sorry for that. (and a good reason to tell in the contributing doc to only link to closed issue but opening a new one).

About waiting to read more on the SSL sockets, did you configured keepalive? was connection draining enabled?Using the eventlet client to solve the issue could mean indeed that was a keepalive issue.

malthe commented 9 years ago

It might simply be a keep-alive timeout mismatch that causes gunicorn to slowly drain on available connections such that eventually, no new connections can be made.

benoitc commented 9 years ago

OK, thanks anyway and sorry for my very late answer again

On Wed, May 20, 2015 at 2:31 PM Malthe Borch notifications@github.com wrote:

It might simply be a keep-alive timeout mismatch that causes gunicorn to slowly drain on available connections such that eventually, no new connections can be made.

— Reply to this email directly or view it on GitHub https://github.com/benoitc/gunicorn/issues/617#issuecomment-103864370.

pankit commented 9 years ago

Hi @dsuch ,

I am trying to access the client cert so that I can load it and verify a few things. I know that gunicorn can do basic handshake stuff by verifying the signing authority on the client cert but I want to open it inside my python app. My set up is gunicorn server + flask app

Let me know if you have an idea.

Thanks, Pankit

dsuch commented 9 years ago

Hi @pankit - I have not used this part of gunicorn in quite a while but essentially, when you accept an HTTP request, there should be a WSGI key called gunicorn.socket or in a similar manner (sorry, I really forget).

This key points to the TLS socket that accepted the connection, i.e. this is the object that is returned by ssl.wrap_socket so you can call .getpeercert() on it to obtain the client certificate.

That's how it worked like the last time I used it which was some two years ago around September 2013.

pankit commented 9 years ago

Are you pointing to gunicorn.sock?

dsuch commented 9 years ago

@pankit - I honestly can't recall its name. I remember the principle but not any exact details, it was two years ago. Thanks.

pankit commented 9 years ago

Thanks! Not a problem. I l try to find it. But ultimately you were able to make it work. right? Were you also using flask app?

dsuch commented 9 years ago

Yes, I was able to make it work without problems. Ultimately for reasons unrelated I went a different route but I even found a piece of code that read TLS certificates from client connections.

I wasn't using Flask but it doesn't matter. The certificate is part of the WSGI environment, it's just a dictionary, no matter the framework.

https://github.com/zatosource/zato/blob/3a804fecaeb331f93c1393db2024122cfa3eaafe/code/zato-server/src/zato/server/base/parallel.py#L126

pankit commented 9 years ago

Just putting for someone who has similar doubt in future. The way I got hold of the socket object is the same as pointed out by @dsuch I was using flask + gunicorn Here is code snippet @app.before_request def checkssl(): sock = request.environ['gunicorn.socket'] cert = sock.getpeercert(binary_form=True) x509 = crypto.load_certificate(crypto.FILETYPE_ASN1, cert)

Now you can use X509 object to validate whatever you need. :)

benoitc / gunicorn

Don't use do_handshake_on_connect=False for SSL sockets #617