celery / py-amqp

amqplib fork
Other
308 stars 194 forks source link

SSL protocol is shutdown #325

Open yuvadm opened 4 years ago

yuvadm commented 4 years ago

I'm running a celery worker which uses a RabbitMQ broker via an SSL connection. From time to time we're seeing an error in our logs

SSLError(1, '[SSL: PROTOCOL_IS_SHUTDOWN] protocol is shutdown (_ssl.c:2162)')

Which is coming from this line:

https://github.com/celery/py-amqp/blob/master/amqp/transport.py#L430

My best understanding is that amqp is attempting to write to a socket that has previously been closed, is it possible to know which side terminate the socket? Can there be any better error handling in this case? Or any possibility to recover from the error?

thedrow commented 4 years ago

Can you please attach the RabbitMQ logs from the time when the exception happened?

yuvadm commented 4 years ago

Here are some RabbitMQ logs surrounding the exact time we got an SSLError:

2020-06-28 13:04:29.152 [warning] <0.37.186> closing AMQP connection <0.37.186> (10.154.0.40:44566 -> 10.54.3.7:5671, vhost: 'our-pod-name', user: 'our-pod-name'):
client unexpectedly closed TCP connection
2020-06-28 13:04:29.188 [warning] <0.742.186> closing AMQP connection <0.742.186> (10.154.0.42:44640 -> 10.54.3.7:5671, vhost: 'our-pod-name', user: 'our-pod-name'):
client unexpectedly closed TCP connection

####### SSLError came up on Celery logs at 13:04:45 #########

2020-06-28 13:05:32.387 [info] <0.10410.186> accepting AMQP connection <0.10410.186> (10.154.0.87:43848 -> 10.54.3.7:5671)
2020-06-28 13:05:32.407 [info] <0.10410.186> connection <0.10410.186> (10.154.0.87:43848 -> 10.54.3.7:5671): user 'our-pod-name' authenticated and granted access to vhost 'our-pod-name'
2020-06-28 13:05:35.533 [info] <0.10423.186> accepting AMQP connection <0.10423.186> (10.154.0.87:43850 -> 10.54.3.7:5671)
2020-06-28 13:05:35.552 [info] <0.10423.186> connection <0.10423.186> (10.154.0.87:43850 -> 10.54.3.7:5671): user 'our-pod-name' authenticated and granted access to vhost 'our-pod-name'
2020-06-28 13:05:35.660 [info] <0.10456.186> accepting AMQP connection <0.10456.186> (10.154.0.40:43852 -> 10.54.3.7:5671)
2020-06-28 13:05:35.680 [info] <0.10456.186> connection <0.10456.186> (10.154.0.40:43852 -> 10.54.3.7:5671): user 'our-pod-name' authenticated and granted access to vhost 'our-pod-name'
2020-06-28 13:05:35.771 [info] <0.10465.186> accepting AMQP connection <0.10465.186> (10.154.0.56:58578 -> 10.54.3.7:5671)
yuvadm commented 4 years ago

Some more details on our setup, the Celery workers are connecting to Rabbit on a separate k8s cluster, with a NAT instance sitting in the middle.

Our current working theory is that the NAT instance is doing something to some of the connections, and Celery/Kombu/pyamqp seems to be angry about it.

We'll continue to debug the exact root cause, but curious if we can get the Celery stack to be more graceful about these dropped connections.

thedrow commented 4 years ago

Since I don't really know what this error is and I can't find a description of it anywhere, I'm not sure how to help.

thedrow commented 4 years ago

This kind of error is raised when the SSL wrapper tries to write to a closed TCP connection. It is usually related to connection problems, firewalls or SSL interception.

I asked on StackOverflow. That's the answer I got.

yuvadm commented 4 years ago

@thedrow Thanks for helping us debug this error!

We're still trying to debug where our connections are being dropped and fix the root cause.

Eventually it would be nice to fix this behavior to handle drops more gracefully, and I'd be happy to help propose a patch for that.

Kristin0 commented 2 years ago

@yuvadm did you find something ?

yuvadm commented 2 years ago

@Kristin0 Didn't really continue working on this issue, sorry

thedrow commented 2 years ago

@Kristin0 If you have this problem. Please share all the details as well.