[SB] `The connection was inactive for more than the allowed 240000 milliseconds`

macieyng commented 1 year ago

Package Name: azure-servicebus
Package Version: 7.11.1
Operating System: macOS/ubuntu
Python Version: 3.8, 3.9, 3.10

Describe the bug After resolving #29961 we've updated servicebus to 7.11.1 and now we're not seeing END frame errors like we used to, but instead we're seeing a lot of Connection closed with error: [b'amqp:connection:forced', b"The connection was inactive for more than the allowed 240000 milliseconds and is closed by container 'de9c0655b1684a31b24a94d4a733ed5e_G4S2'.", None] errors. That definitely is coming from https://github.com/Azure/azure-sdk-for-python/blob/ee85ddd394052efaaa35352e2567b39b6d2a215e/sdk/servicebus/azure-servicebus/azure/servicebus/_pyamqp/_connection.py#L478-L485.

I'm not sure whether this should be considered an error message, but I don't really understand how this amqp protocol works.

I want to be clear about one thing that may have something to do with it: we create an instance of ServiceBusClient and use it outside of context manager, and the same for topic or queue senders. The reason for this approach is performance - we prefer connection being closed due to inactivity over manual connection closing (like with calling close or context manager). This seems reasonable to us, because other solution would be to initialize service bus worker thread that would monitor connection time and send messages asynchronously.

Most errors coming from 3rd party libraries I treat as something I should probably take care of or there is an issue I should resolve somehow. In this case, I don't think I could do anything more, so on one hand I would silence this logger, but other error may occur that I wouldn't know about.

I would be very grateful for your support 🙏

To Reproduce Steps to reproduce the behavior:

Probably leaving a connection hanging with no traffic for more than 240000 milliseconds

Expected behavior Dunno, maybe this shouldn't be considered error? Just a warning that service was disconnected would be sufficient.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context We talked with @kashifkhan in #29961. https://github.com/Azure/azure-sdk-for-python/issues/29961#issuecomment-1638427973 https://github.com/Azure/azure-sdk-for-python/issues/29961#issuecomment-1638588649 https://github.com/Azure/azure-sdk-for-python/issues/29961#issuecomment-1640953328 https://github.com/Azure/azure-sdk-for-python/issues/29961#issuecomment-1642070107

kashifkhan commented 1 year ago

Thanks for opening a new issue @macieyng . Ill check with the team and update.

In the meantime is it normal in your set up to not receive messages from the queue ( or subscription ) for 30 seconds ? Largely this error can be ignored as the client should retry connecting to the service and resume properly. If its happening a lot and there is supposed to be a steady stream of messages, it can indicate some network issues.

macieyng commented 1 year ago

Yes. This is this normal. We process orders from one time zone, so during night time there's little to no traffic.

Right now, this is something we observe on dev and test environments, where traffic is generated only by our QA engineers, but I expect this to appear on our production once this is released. On production I would assume that there might be no traffic for 15 to 30 minutes on certain queues during night time.

kashifkhan commented 1 year ago

Thank you for the update @macieyng . We will be working on this issue during this sprint and will update this issue with the resolution

macieyng commented 1 year ago

I think that we won't be able to provide more feedback, because we observed a 2x slowdown with pyamqp and we decided to use uamqp for as long as it will be possible on all three of our environments.

Also I would like to add that recommended way of using context manager for creating service bus client and then new context for service bus sender also adds latency big time (up to 4x).

kashifkhan commented 1 year ago

@macieyng We will be updating the error messages to show up only during DEBUG and will be releasing it in the upcoming release.

I'm very interested in resolving the performance issues you are experiencing as we run performance tests nightly to make sure pyamqp is running well. Infact for service bus ( and event hub before it) the goal was to have pyamqp be at par if not faster than uamqp, and overall pyamqp was much faster.

Our goal is pyamqp for Event Hubs and Service Bus and usage of uaqmp only as an emergency mitigation. If you are willing, I can share my email address and you can send me details about your environment, set up etc. so we can try to reproduce the issue on our end.

ghost commented 10 months ago

I want to add to the comment above:

Also I would like to add that recommended way of using context manager for creating service bus client and then new context for service bus sender also adds latency big time (up to 4x).

We run async Eventhub and ServiceBus clients with pyamqp and opening connections for each publish message is vastly slower than keeping those connections open permanently (until socket closed by server or app restarts). In our tracing spans, we can see that opening connections to ServiceBus queues or Eventhubs regularly takes 600ms-1s! As a result, we are forced to do the same as stated above.

We have discussed building our own connection pooling libraries for these connections so that we can set ttl on each connection and pre-empt the socket close from the server, but the documentation from Azure is indefinite on the idle time allowed.

One alternative is to send a heartbeat dummy message every so often, but that's probably wasteful for apps with longer periods of inactivity.

kashifkhan commented 10 months ago

Hi @erewokMF, this particular issue is for changing the logging level of that message. I do agree that we have to update the samples a bit to show that the EventHub and ServiceBus clients can be used without the context managers and require the user to manually close the connection.

With the services in general, they have a time limit after which they drop open connections - usually when no new links or transfers have happened over a period of time. The connections over AMQP are a highly stateful, so we cant hot swap to another Connection as you would have do re-do some authentication, sessions, links etc.

The preferred method is the keep alive option and we are working to add in keep alive for the senders, where there are scenarios of inactivity.

ghost commented 10 months ago

Got it. I landed here because we would like to see the loglevel changed for this message too (failed to mention that).

swathipil commented 8 months ago

Hi @macieyng - Sorry about the delay here and thanks for your patience! The error level log has since been updated to warning in this PR: https://github.com/Azure/azure-sdk-for-python/pull/34504

The fix was released in azure-servicebus v7.12.0.

github-actions[bot] commented 8 months ago

Hi @macieyng. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text "/unresolve" to remove the "issue-addressed" label and continue the conversation.

github-actions[bot] commented 8 months ago

Hi @macieyng, since you haven’t asked that we /unresolve the issue, we’ll close this out. If you believe further discussion is needed, please add a comment /unresolve to reopen the issue.

Azure / azure-sdk-for-python

[SB] `The connection was inactive for more than the allowed 240000 milliseconds` #31229