Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.38k stars 2.72k forks source link

ServiceBusConnectionError after running idle #36376

Open bosegeorge1212 opened 1 week ago

bosegeorge1212 commented 1 week ago

Describe the bug We are getting the following error while receiving messages from Service Bus: azure.servicebus.exceptions.ServiceBusConnectionError: Cannot read frame due to exception: [Errno 104] Connection reset by peer. Error condition: amqp:socket-error.

We observe this behavior when there are no messages in the subscription and you keep the app running.

To Reproduce There are no exact steps to reproduce this issue. It happens inconsistently. We observe this happening when there are no messages in the subscription, and the app has been running for a long duration.

Expected behavior The SDK should retry with some delay.

Logs

2024-07-02 17:57:17,088 - azure.servicebus._pyamqp.aio._transport_async - DEBUG - Transport read failed: ConnectionResetError(104, 'Connection reset by peer')
2024-07-02 17:57:17,088 - azure.servicebus._pyamqp.aio._cbs_async - DEBUG - CBS status check: state == <CbsAuthState.OK: 0>, expired == False, refresh required == False
2024-07-02 17:57:17,088 - azure.servicebus.aio._base_handler_async - INFO - AMQP error occurred: (AMQPConnectionError('Error condition: ErrorCondition.SocketError\n Error Description: Can not read frame due to exception: [Errno 104] Connection reset by peer')), condition: (<ErrorCondition.SocketError: b'amqp:socket-error'>), description: ('Can not read frame due to exception: [Errno 104] Connection reset by peer').
2024-07-02 17:57:17,088 - azure.servicebus._pyamqp.aio._link_async - DEBUG - -> DetachFrame(handle=3, closed=True, error=None)
2024-07-02 17:57:17,089 - azure.servicebus._pyamqp.aio._link_async - INFO - An error occurred when detaching the link: AMQPConnectionError('Error condition: ErrorCondition.SocketError\n Error Description: Can not read frame due to exception: [Errno 104] Connection reset by peer')
2024-07-02 17:57:17,089 - azure.servicebus._pyamqp.aio._link_async - INFO - Link state changed: <LinkState.ATTACHED: 3> -> <LinkState.DETACHED: 0>
2024-07-02 17:57:17,089 - azure.servicebus._pyamqp.aio._link_async - DEBUG - -> DetachFrame(handle=2, closed=True, error=None)
2024-07-02 17:57:17,089 - azure.servicebus._pyamqp.aio._link_async - INFO - An error occurred when detaching the link: AMQPConnectionError('Error condition: ErrorCondition.SocketError\n Error Description: Can not read frame due to exception: [Errno 104] Connection reset by peer')
2024-07-02 17:57:17,089 - azure.servicebus._pyamqp.aio._link_async - INFO - Link state changed: <LinkState.ATTACHED: 3> -> <LinkState.DETACHED: 0>
2024-07-02 17:57:17,089 - azure.servicebus._pyamqp.aio._management_link_async - INFO - Management link receiver state changed: <LinkState.ATTACHED: 3> -> <LinkState.DETACHED: 0>
2024-07-02 17:57:17,089 - azure.servicebus._pyamqp.aio._link_async - DEBUG - -> DetachFrame(handle=1, closed=True, error=None)
2024-07-02 17:57:17,089 - azure.servicebus._pyamqp.aio._link_async - INFO - An error occurred when detaching the link: AMQPConnectionError('Error condition: ErrorCondition.SocketError\n Error Description: Can not read frame due to exception: [Errno 104] Connection reset by peer')
2024-07-02 17:57:17,089 - azure.servicebus._pyamqp.aio._link_async - INFO - Link state changed: <LinkState.ATTACHED: 3> -> <LinkState.DETACHED: 0>
2024-07-02 17:57:17,089 - azure.servicebus._pyamqp.aio._management_link_async - INFO - Management link sender state changed: <LinkState.ATTACHED: 3> -> <LinkState.DETACHED: 0>
2024-07-02 17:57:17,089 - azure.servicebus._pyamqp.aio._session_async - DEBUG - -> EndFrame(error=None)
2024-07-02 17:57:17,089 - azure.servicebus._pyamqp.aio._session_async - INFO - An error occurred when ending the session: AMQPConnectionError('Error condition: ErrorCondition.SocketError\n Error Description: Can not read frame due to exception: [Errno 104] Connection reset by peer')
2024-07-02 17:57:17,089 - azure.servicebus._pyamqp.aio._session_async - INFO - Session state changed: <SessionState.MAPPED: 3> -> <SessionState.UNMAPPED: 0>
2024-07-02 17:57:17,089 - azure.servicebus._pyamqp.aio._connection_async - DEBUG - -> CloseFrame(error=None)
2024-07-02 17:57:17,089 - azure.servicebus._pyamqp.aio._connection_async - INFO - An error occurred when closing the connection: AMQPConnectionError('Error condition: ErrorCondition.SocketError\n Error Description: Can not read frame due to exception: [Errno 104] Connection reset by peer')
2024-07-02 17:57:17,089 - azure.servicebus._pyamqp.aio._connection_async - INFO - Connection state changed: <ConnectionState.OPENED: 9> -> <ConnectionState.END: 13>
2024-07-02 17:57:17,883 - azure.servicebus.aio._base_handler_async - INFO - 'servicebus.pysdk-22e5f6be' operation has exhausted retry. Last exception: ServiceBusConnectionError('Can not read frame due to exception: [Errno 104] Connection reset by peer Error condition: amqp:socket-error.').

this is the code snippet inside a package

class SubscriptionListener:
    def __init__(self, connection_string: str, topic_name: str, subscription_name: str, max_concurrent_tasks=5):
        self._connection_string = connection_string
        self._topic_name=topic_name
        self._subscription_name = subscription_name
        self._message_handlers: Dict[type, Callable] = {}

    def add_message_handler(self, message_type: type, handler: Callable):
        self._message_handlers[message_type] = handler

    async def receive_messages(self):
        async with ServiceBusClient.from_connection_string(self._connection_string,logging_enable=True) as client:
            async with client.get_subscription_receiver(topic_name=self._topic_name,subscription_name= self._subscription_name) as receiver:
                async for message in receiver:
                    await self._process_and_complete_message(receiver, message)    

and this is code snippet inside our API that invoke the above

service_bus_receiver = SubscriptionListener(conn_str, topic_name,subscription_name)
service_bus_receiver.add_message_handler(ExtractionMessage, extraction_message_handler.handle)
await service_bus_receiver.receive_messages()
github-actions[bot] commented 1 week ago

Thank you for your feedback. Tagging and routing to the team member best able to assist.

kashifkhan commented 1 week ago

Hi @bosegeorge1212 ,

I am seeing this line in your logs which shows up after retries are done.

2024-07-02 17:57:17,883 - azure.servicebus.aio._base_handler_async - INFO - 'servicebus.pysdk-22e5f6be' operation has exhausted retry. Last exception: ServiceBusConnectionError('Can not read frame due to exception: [Errno 104] Connection reset by peer Error condition: amqp:socket-error.').

It seems like the connection is dropping. Are you able to provide us debug logging with frames turned on please?

logging_enable=True is also key here for us to see the exchange of information between client and sdk

import logging
import sys

handler = logging.StreamHandler(stream=sys.stdout)
logger = logging.getLogger('azure.servicebus')
logger.setLevel(logging.DEBUG)
logger.addHandler(handler)

...

from azure.servicebus import ServiceBusClient

client = ServiceBusClient(..., logging_enable=True)
bosegeorge1212 commented 1 week ago

Hi @kashifkhan

i have already added it and the above logs are with that. Thanks in advance

logger.setLevel(logging.DEBUG)

ServiceBusClient.from_connection_string(conn_str=connection_string,  logging_enable=True)
swathipil commented 1 week ago

Hi @bosegeorge1212 - It looks like all retries have been exhausted as per:

2024-07-02 17:57:17,883 - azure.servicebus.aio._base_handler_async - INFO - 'servicebus.pysdk-22e5f6be' operation has exhausted retry. Last exception: ServiceBusConnectionError('Can not read frame due to exception: [Errno 104] Connection reset by peer Error condition: amqp:socket-error.').

The last retry raised an exception due to connection drop as @kashifkhan mentioned, which is expected behavior. By default, the operation is retried 3 times. The first couple of retries are not included in the log snippet above. To confirm that all retries failed due to connection dropping, would you be able to provide logs beginning from before this exception started occurring?

If you'd like to modify default retry behavior, you can adjust retry configs, listed [here].

github-actions[bot] commented 1 week ago

Hi @bosegeorge1212. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

bosegeorge1212 commented 1 week ago

@swathipil

Is there any way to identify the root cause of the connection loss? It appears to be a recurring pattern where the connection is lost every 10-15 hours of no messages.

We are hosting the application in Azure Container Apps and utilizing a VNET as well.

swathipil commented 1 week ago

@bosegeorge1212 ServiceBusConnectionErrors are caused by transient network issues or service problems. However, this error is not returned by the service, as they would provide more information such as a tracking ID or a message to retry. This is an error raised by the OS to indicate that the connection dropped. Given that it's not able to reconnect on retry, it seems that it was disconnected for a while.

Is there other activity happening on your VNET (heavier than usual) that could be causing issues?

In order to provide more assistance, we would also need the full logs including all retries.

github-actions[bot] commented 1 week ago

Hi @bosegeorge1212. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

bosegeorge1212 commented 1 week ago

logs_3_7_2024.xlsx

bosegeorge1212 commented 1 week ago

@swathipil I have added the logs. I tried to color the start and end of the error in yellow.

Is there any other activity happening on your VNET (heavier than usual) that could be causing issues? -> Nothing as far as we know.

swathipil commented 2 days ago

Thanks @bosegeorge1212. We'll investigate and get back to you asap.