Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.63k stars 2.84k forks source link

The connection was inactive for more than the allowed 240000 milliseconds and is closed by container #9917

Closed luann-trend closed 4 years ago

luann-trend commented 4 years ago

Describe the bug Follow the example for Consuming events from an Event Hub asynchronously, having the consumer streaming client running in low incoming event to the eventhub. The connection got disconnected from the server without any exception or error throw back to the client so that the client can try to re-connect.

To Reproduce Steps to reproduce the behavior:

  1. Setup eventhub, with low incoming event (may be idle for more than 4 minutes with no event)
  2. Start the streaming client from the sample codes "Consume events asynchronously from an EventHub." https://github.com/Azure/azure-sdk-for-python/tree/master/sdk/eventhub/azure-eventhub#consume-events-from-an-event-hub

Expected behavior Expect the connection stay alive for the streaming client.

Screenshots Logs:

I 2020-02-19T21:08:30.073272323Z Connection: b'EHReceiver-e54defd1-79de-4906-b0a7-d14dd0ee7f19-partition16'

I 2020-02-19T21:08:30.073291092Z [2020-02-19 21:08:30,073 receiver::on_state_changed INFO] Message receiver b'receiver-link-23dee092-1a1d-4483-bc04-d63cdd126ef5' state changed from <MessageReceiverState.Open: 2> to <MessageReceiverState.Error: 4> on connection: b'EHReceiver-e54defd1-79de-4906-b0a7-d14dd0ee7f19-partition16'

I 2020-02-19T21:08:30.073609170Z [2020-02-19 21:08:30,073 connection::_state_changed INFO] Connection b'EHReceiver-e54defd1-79de-4906-b0a7-d14dd0ee7f19-partition16' state changed from <ConnectionState.OPENED: 9> to <ConnectionState.CLOSE_RCVD: 10>

I 2020-02-19T21:08:30.075900631Z [2020-02-19 21:08:30,075 connection::_state_changed INFO] Connection b'EHReceiver-e54defd1-79de-4906-b0a7-d14dd0ee7f19-partition16' state changed from <ConnectionState.CLOSE_RCVD: 10> to <ConnectionState.END: 13>

I 2020-02-19T21:08:30.075958832Z [2020-02-19 21:08:30,075 connection::_close_received INFO] Received Connection close event: b'amqp:connection:forced'

I 2020-02-19T21:08:30.075979621Z Connection: b'EHReceiver-e54defd1-79de-4906-b0a7-d14dd0ee7f19-partition16'

I 2020-02-19T21:08:30.075986333Z Description: b"The connection was inactive for more than the allowed 240000 milliseconds and is closed by container '0e8f353073374ab9818f9759b9e33417_G16'."

I 2020-02-19T21:08:30.075992159Z Details: None

I 2020-02-19T21:08:30.076187828Z [2020-02-19 21:08:30,076 connection_async::work_async INFO] b'SSL channel closed in decode_ssl_received_bytes.' (b'/tmp/pip-install-m0c6yhjv/uamqp/src/vendor/azure-uamqp-c/deps/azure-c-shared-utility/adapters/tlsio_openssl.c':b'decode_ssl_received_bytes':797)

I 2020-02-19T21:08:30.076227608Z [2020-02-19 21:08:30,076 connection_async::work_async INFO] b'Error callback received in unexpected state' (b'/tmp/pip-install-m0c6yhjv/uamqp/src/vendor/azure-uamqp-c/src/saslclientio.c':b'on_underlying_io_error':243)

I 2020-02-19T21:08:30.076360494Z [2020-02-19 21:08:30,076 connection_async::work_async INFO] b'Error in decode_ssl_received_bytes.' (b'/tmp/pip-install-m0c6yhjv/uamqp/src/vendor/azure-uamqp-c/deps/azure-c-shared-utility/adapters/tlsio_openssl.c':b'on_underlying_io_bytes_received':846)

I 2020-02-19T21:08:30.129672081Z [2020-02-19 21:08:30,129 receiver::get_state WARNING] LinkDetach("ErrorCodes.ConnectionCloseForced: The connection was inactive for more than the allowed 240000 milliseconds and is closed by container '0e8f353073374ab9818f9759b9e33417_G16'.")

I 2020-02-19T21:08:30.129744298Z [2020-02-19 21:08:30,129 client_async::close_async INFO] CBS session pending b'EHReceiver-e54defd1-79de-4906-b0a7-d14dd0ee7f19-partition16'.

Additional context

kaerm commented 4 years ago

@luann-trend thanks for letting us know, someone from our team will take a look at this //cc: @mayurid @yunhaoling @YijunXieMS

yunhaoling commented 4 years ago

Hi @luann-trend , thanks for bringing this up. We're investigating the issue.

Just for curiosity, does this behavior affect your functionality? -- The ConsumerClient should be able to automatically re-connect to the service, it will resume receiving. And how're are you dealing with the error that is thrown back to the client?

luann-trend commented 4 years ago

Yes, this behavior affect our functionality since our application was expected the streaming connection is never get disconnected. If the connection got disconnected the consumerClient.receive should throw exception so that we can re-establish the connection again.

YijunXieMS commented 4 years ago

@luann-trend How often does this error happen to you? Our design is to keep the connection alive. My environments don't have this 24000ms problem. I'll test with your environment. What container environment do you use? Besides, the ConsumerClient.receive( ) is a high level API. It calls callback on_error and automatically reconnects when a connection problem happens. It doesn't raise the connection error out.

luann-trend commented 4 years ago

I am using docker image python:3.7-alpine3.10

YijunXieMS commented 4 years ago

Is the OS windows or Linux where you install the docker? Or do you use AKS? I guess it's Linux, so what version of Linux?

yunhaoling commented 4 years ago

Hello @luann-trend ,

As @YijunXieMS mentioned, in our implementation the sdk is supposed to keep trying to reconnect internally whenever there's an error. In your observation, the sdk would stop receiving after being 240s inactive and no error is raised out (suppose you have passed in on_error callback, the callback is never called).

Is my statement correct?

luann-trend commented 4 years ago

@YijunXieMS Yes, it is linux. Linux 4.19.76+ #1 SMP Tue Oct 8 23:17:06 PDT 2019 x86_64 Linux

@yunhaoling It is correct. The SDK stop sending logs to the consumer client, and there is no error or exception thrown. The consumer client passes in the on_event, on_partition_initialize, on_partition_closed, on_error.

yunhaoling commented 4 years ago

@luann-trend Great thanks for the extra information, it really helps. We'll keep you updated if we've made any progress on the issue.

YijunXieMS commented 4 years ago

@luann-trend I don't have an Alpine linux env. But I just tested on Ubuntu with docker (image python:3.7.6-buster). The 240000 connection inactivity didn't happen even though there were no events received for long time. This is by design. So the problem is why that happened in your env. One thing may cause the problem is some operations block the program so there isn't network activity between the SDK and eventhub. But this sample code shouldn't have this problem. I'm wondering whether that happened very often or occasionally in your environment.

About on_error not being called, the SDK automatically rebuilds the connection when this exception happens and continue to receive events from where it's left off. So on_error isn't called. Instead it calls on_event when a new event is received. Some errors don't have this reconnect logic, and if retry-able errors happen too many times, on_error is called, the partition processing is lost from this client, and after a while one client (this client or another one) retakes this partition and start receiving from it again.

So could you help run a test for at least hours. See how many times this error happens and whether new events are processed after this error happens?

luann-trend commented 4 years ago

We have seen the problem couple time 2 weeks ago, but have not seen it again. I will post an update when I see it.

YijunXieMS commented 4 years ago

Thanks Launn, let us know if it happened again.

amitgadhavi7 commented 4 years ago

Hi Yijun,

I got the below error when i used the Sync code to establish the connection using Client ID/Secret. it ran fine but after 1000 events i got the below error,

LinkDetach("ErrorCodes.ConnectionCloseForced: The connection was inactive for more than the allowed 240000 milliseconds and is closed by container 'b86583903f7441ba8a60ee7041ab1b16_G6'.")

I thought the concept is to keep the connection open to receive whenever there is new data. Then why am i getting this error. Can you please let us know.

Amit.

YijunXieMS commented 4 years ago
  1. Do you have a long operation that blocks the receiving for long time?
  2. The server closes the connection if there is no request to it. A long operation or other problems can cause it. But the client will reconnect to the server automatically even with this error and receive from where it left off. Do you find it resumes receiving?
YijunXieMS commented 4 years ago

@amitgadhavi7 not sure about your scenario. If you have I/O bound operations after receiving the events, I recommend the async instead of the sync one.

amitgadhavi7 commented 4 years ago

Hi Yijun,

We have approx 7500 events in EventHub but i only got 1145 events. The potential root cause could be when i write the event into JSON file. Per each event, i create a JSON file and stamp the date time in the file name together with the partition ID. However, the precision of the time is only at second meaning that i would be able to get 1 event/second/partition which does not seems right. My guess is that i have received many events in 1 second but it may happen that i keep overwriting the file again and again. That is causing some events get missing. How do i avoid this situation ? What changes do i need to do ?

YijunXieMS commented 4 years ago

@amitgadhavi7 could you verify by just counting the number of message received? For instance, something like this.

received = 0
def on_event(partition_context, event):
    global received
    received += 1
    print(received)

For the filename, you can add a serial number as a part of the filename to avoid filename duplicate.

YijunXieMS commented 4 years ago

@amitgadhavi7 Do you have more questions for this issue? If no, I'm going to close this issue.