ably / ably-java

Java, Android, Clojure and Scala client library SDK for Ably realtime messaging service
https://ably.com/download
Apache License 2.0
86 stars 39 forks source link

Presence.get(true) blocks execution for some time #694

Open KacperKluka opened 3 years ago

KacperKluka commented 3 years ago

Some time ago there was an issue that channel.presence.get(true) was blocking the program execution and it was fixed in this PR https://github.com/ably/ably-java/pull/669

I've noticed that sometimes code execution still blocks on the same channel.presence.get(true) but for exactly 2 minutes. Maybe the code waits for something and then when a timeout is thrown it just continues?

Here's a log from Ably SDK after calling the troublesome method:

14:27:24.726 io.ably.lib.realtime.Channel: attach(); channel = tracking:abc123qwe
14:29:23.874 io.ably.lib.transport.WebSocketTransport: onClose(): wsCode = 1006; wsReason = The connection was closed because the other endpoint did not respond with a pong in time. For more information check: https://github.com/TooTallNate/Java-WebSocket/wiki/Lost-connection-detection; remote = false
14:29:23.876 io.ably.lib.http.HttpCore: HTTP request: https://internet-up.ably-realtime.com/is-the-internet-up.txt GET
14:29:23.877 io.ably.lib.http.HttpCore:   Accept: application/json
14:29:23.877 io.ably.lib.http.HttpCore:   X-Ably-Lib: android-1.2.6
14:29:23.877 io.ably.lib.http.HttpCore:   X-Ably-Version: 1.2
14:29:24.373 io.ably.lib.http.HttpCore: HTTP response:
14:29:24.373 io.ably.lib.http.HttpCore: Accept-Ranges: bytes
14:29:24.373 io.ably.lib.http.HttpCore: CF-Cache-Status: DYNAMIC
14:29:24.374 io.ably.lib.http.HttpCore: CF-RAY: 67a007e448894e80-FRA
14:29:24.374 io.ably.lib.http.HttpCore: Connection: keep-alive
14:29:24.374 io.ably.lib.http.HttpCore: Content-Length: 4
14:29:24.374 io.ably.lib.http.HttpCore: Content-Type: text/plain
14:29:24.374 io.ably.lib.http.HttpCore: Date: Thu, 05 Aug 2021 12:29:24 GMT
14:29:24.374 io.ably.lib.http.HttpCore: ETag: "eb4585ad9fe0426781ed7c49252f8225"
14:29:24.374 io.ably.lib.http.HttpCore: Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
14:29:24.374 io.ably.lib.http.HttpCore: Last-Modified: Wed, 14 Jan 2015 16:12:25 GMT
14:29:24.374 io.ably.lib.http.HttpCore: NEL: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
14:29:24.374 io.ably.lib.http.HttpCore: Report-To: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=oPCqV%2FDx5sHAJ7L%2FkklQvhqk6CqwFycL4A0IYiDvgMjGylcljH%2Bz4BGXOK8RYcXpFl3PN3pMy412eikcWA5EFZ0mhdFD8E1oNyj3uXOhnD82mnXCZxiJARLNOc%2B9evcX41mrpb5TEAfVvNtN3Bxk"}],"group":"cf-nel","max_age":604800}
14:29:24.374 io.ably.lib.http.HttpCore: Server: cloudflare
14:29:24.374 io.ably.lib.http.HttpCore: x-amz-id-2: 8/Sk5SAUjf+GaO/4/BDvW8xo1+sXeHFD4EtPegOsgIz8CtUm9vimUwMEH0UdQw3MAfhWc/eXRCM=
14:29:24.375 io.ably.lib.http.HttpCore: x-amz-request-id: 0NKE009MBNZE49J6
14:29:24.375 io.ably.lib.http.HttpCore: x-amz-version-id: null
14:29:24.375 io.ably.lib.http.HttpCore: X-Android-Received-Millis: 1628166564373
14:29:24.375 io.ably.lib.http.HttpCore: X-Android-Response-Source: NETWORK 200
14:29:24.375 io.ably.lib.http.HttpCore: X-Android-Selected-Protocol: http/1.1
14:29:24.375 io.ably.lib.http.HttpCore: X-Android-Sent-Millis: 1628166564064
14:29:24.375 io.ably.lib.http.HttpCore: yes
14:29:24.375 io.ably.lib.transport.ConnectionManager: checkFallback: fallback to D.ably-realtime.com
14:29:24.376 io.ably.lib.transport.ConnectionManager: requestState(): requesting connecting; id = BjauyyPF-Q
14:29:24.376 io.ably.lib.transport.ConnectionManager: Wait ended by action: io.ably.lib.transport.ConnectionManager$AsynchronousStateChangeAction@49abaad
14:29:24.376 io.ably.lib.transport.ConnectionManager: setState(): setting connecting; reason null
14:29:24.378 io.ably.lib.transport.WebSocketTransport: close()
14:29:24.378 Auth.assertValidToken():
14:29:24.378 Auth.assertValidToken(): using cached token; expires = 1628170043986
14:29:24.378 TokenAuth.getTokenDetails():
14:29:24.379 io.ably.lib.transport.ITransport: getConnectParams: params = [accessToken:j8Cjlw.IB1_vb0_vlzIYNi21B8T6_mOV8dc0Gx8fcdoE4G-yOD0axGx-8eoJZS8KXhqieAu_YxpGU-CAuDP0OYFsBsaRwfaKAMW12bEEDUJaZWg2mCXz2zW8Iiw8TJejyoD9FQX_ggjIEdxlV6B5s0OQLACYJEpupLXZ3p4vuDkkXb-IqNBsTbg42cziG5Z_BQ7_VxkjXDcLlmTsYU1llWCsNWMItr0iCiMmmLguj-sWK6l--UQ, v:1.2, format:msgpack, resume:e9196qPXQB19Gm!BjauyyPF-QtYJD3s-51f69e9196qPXQB19Gm, connectionSerial:-1, clientId:<INSERT_CLIENT_ID_HERE>, heartbeats:false, lib:android-1.2.6]
14:29:24.380 io.ably.lib.transport.WebSocketTransport: connect(); wsUri = wss://D.ably-realtime.com:443/?accessToken=j8Cjlw.IB1_vb0_vlzIYNi21B8T6_mOV8dc0Gx8fcdoE4G-yOD0axGx-8eoJZS8KXhqieAu_YxpGU-CAuDP0OYFsBsaRwfaKAMW12bEEDUJaZWg2mCXz2zW8Iiw8TJejyoD9FQX_ggjIEdxlV6B5s0OQLACYJEpupLXZ3p4vuDkkXb-IqNBsTbg42cziG5Z_BQ7_VxkjXDcLlmTsYU1llWCsNWMItr0iCiMmmLguj-sWK6l--UQ&v=1.2&format=msgpack&resume=e9196qPXQB19Gm%21BjauyyPF-QtYJD3s-51f69e9196qPXQB19Gm&connectionSerial=-1&clientId=%3CINSERT_CLIENT_ID_HERE%3E&heartbeats=false&lib=android-1.2.6
14:29:24.650 io.ably.lib.transport.WebSocketTransport: onOpen()
14:29:24.651 io.ably.lib.transport.WebSocketTransport: checkActivity: ok
14:29:24.653 io.ably.lib.transport.WebSocketTransport: onMessage(): msg (binary) = io.ably.lib.types.ProtocolMessage@5c7d548
14:29:24.654 io.ably.lib.transport.ConnectionManager: onMessage() (transport = io.ably.lib.transport.WebSocketTransport [wss://D.ably-realtime.com:443/?accessToken=j8Cjlw.IB1_vb0_vlzIYNi21B8T6_mOV8dc0Gx8fcdoE4G-yOD0axGx-8eoJZS8KXhqieAu_YxpGU-CAuDP0OYFsBsaRwfaKAMW12bEEDUJaZWg2mCXz2zW8Iiw8TJejyoD9FQX_ggjIEdxlV6B5s0OQLACYJEpupLXZ3p4vuDkkXb-IqNBsTbg42cziG5Z_BQ7_VxkjXDcLlmTsYU1llWCsNWMItr0iCiMmmLguj-sWK6l--UQ&v=1.2&format=msgpack&resume=e9196qPXQB19Gm%21BjauyyPF-QtYJD3s-51f69e9196qPXQB19Gm&connectionSerial=-1&clientId=%3CINSERT_CLIENT_ID_HERE%3E&heartbeats=false&lib=android-1.2.6]): connected: {"action":4,"connectionDetails":{"clientId":"\u003cINSERT_CLIENT_ID_HERE\u003e","connectionKey":"e9196qPXQB19Gm!BjauyyPF-QtYJD3s-520afe9196qPXQB19Gm","connectionStateTtl":120000,"maxFrameSize":262144,"maxIdleInterval":15000,"maxInboundRate":50,"maxMessageSize":65536,"maxOutboundRate":100,"serverId":"frontend.7f86.2.eu-central-1-A.i-067dde2ad4d954e4e.e9196qPXQB19Gm"},"connectionId":"BjauyyPF-Q","connectionSerial":-1,"count":0,"flags":0,"timestamp":0}
14:29:24.654 io.ably.lib.transport.ConnectionManager: requestState(): requesting connected; id = BjauyyPF-Q
14:29:24.654 io.ably.lib.transport.ConnectionManager: Wait ended by action: io.ably.lib.transport.ConnectionManager$AsynchronousStateChangeAction@6c1efe1
14:29:24.654 io.ably.lib.transport.ConnectionManager: setState(): setting connected; reason null
14:29:24.655 io.ably.lib.transport.WebSocketTransport: onMessage(): msg (binary) = io.ably.lib.types.ProtocolMessage@afe106
14:29:24.655 io.ably.lib.realtime.Channel: sync(); channel = tracking:abc123qwe
14:29:24.655 io.ably.lib.transport.WebSocketTransport: send(); action = sync
14:29:24.655 io.ably.lib.transport.ConnectionManager: onMessage() (transport = io.ably.lib.transport.WebSocketTransport [wss://D.ably-realtime.com:443/?accessToken=j8Cjlw.IB1_vb0_vlzIYNi21B8T6_mOV8dc0Gx8fcdoE4G-yOD0axGx-8eoJZS8KXhqieAu_YxpGU-CAuDP0OYFsBsaRwfaKAMW12bEEDUJaZWg2mCXz2zW8Iiw8TJejyoD9FQX_ggjIEdxlV6B5s0OQLACYJEpupLXZ3p4vuDkkXb-IqNBsTbg42cziG5Z_BQ7_VxkjXDcLlmTsYU1llWCsNWMItr0iCiMmmLguj-sWK6l--UQ&v=1.2&format=msgpack&resume=e9196qPXQB19Gm%21BjauyyPF-QtYJD3s-51f69e9196qPXQB19Gm&connectionSerial=-1&clientId=%3CINSERT_CLIENT_ID_HERE%3E&heartbeats=false&lib=android-1.2.6]): attached: {"action":11,"channel":"tracking:abc123qwe","channelSerial":"e91YVD2-QB17Y919488119:-1","count":0,"flags":983045,"timestamp":0}
14:29:24.655 io.ably.lib.realtime.Channel: RealtimeChannel(); channel = tracking:abc123qwe
14:29:24.655 io.ably.lib.realtime.Channel: setAttached(); channel = tracking:abc123qwe, resumed = true
14:29:24.655 io.ably.lib.realtime.Channel: Server initiated attach for channel tracking:abc123qwe
14:29:24.656 io.ably.lib.transport.WebSocketTransport: send(): sync: {"action":16,"channel":"tracking:abc123qwe","count":0,"flags":0,"timestamp":0}
14:29:24.656 io.ably.lib.transport.WebSocketTransport: onMessage(): msg (binary) = io.ably.lib.types.ProtocolMessage@216d3c7
14:29:24.656 io.ably.lib.transport.ConnectionManager: onMessage() (transport = io.ably.lib.transport.WebSocketTransport [wss://D.ably-realtime.com:443/?accessToken=j8Cjlw.IB1_vb0_vlzIYNi21B8T6_mOV8dc0Gx8fcdoE4G-yOD0axGx-8eoJZS8KXhqieAu_YxpGU-CAuDP0OYFsBsaRwfaKAMW12bEEDUJaZWg2mCXz2zW8Iiw8TJejyoD9FQX_ggjIEdxlV6B5s0OQLACYJEpupLXZ3p4vuDkkXb-IqNBsTbg42cziG5Z_BQ7_VxkjXDcLlmTsYU1llWCsNWMItr0iCiMmmLguj-sWK6l--UQ&v=1.2&format=msgpack&resume=e9196qPXQB19Gm%21BjauyyPF-QtYJD3s-51f69e9196qPXQB19Gm&connectionSerial=-1&clientId=%3CINSERT_CLIENT_ID_HERE%3E&heartbeats=false&lib=android-1.2.6]): sync: {"action":16,"channel":"tracking:abc123qwe","channelSerial":"e9196qPXQB19Gm70224726:","count":0,"flags":0,"presence":[],"timestamp":0}
14:29:24.656 io.ably.lib.realtime.Channel: RealtimeChannel(); channel = tracking:abc123qwe
14:29:24.657 io.ably.lib.realtime.Channel: onSync(); channel = tracking:abc123qwe
14:29:24.657 io.ably.lib.realtime.Channel: onPresence(); channel = tracking:abc123qwe; syncChannelSerial = e9196qPXQB19Gm70224726:
14:29:24.657 io.ably.lib.realtime.Channel: setPresence(); channel = tracking:abc123qwe; broadcast = true; syncChannelSerial = e9196qPXQB19Gm70224726:
14:29:24.657 io.ably.lib.realtime.Channel: endSync(); channel = tracking:abc123qwe; syncInProgress = true

┆Issue is synchronized with this Jira Task by Unito

paddybyers commented 3 years ago

cc @SimonWoolf: the above log contains the messages exchanged. Is this a library issue or a realtime issue?

SimonWoolf commented 3 years ago

Log shows attach() being called at 14:27:24, then absolutely nothing for two minutes, then at 14:29:23 the library notices it's not actually connected (websocket lib noticing lack of pong response), and connects, successfully resumes, and does a presence sync all within a second.

So I guess the connection had silently died without properly closing the tcp connection? That can happen sometimes, but -- question is why did the library take two minutes to notice that? Lack of activity should have triggered the maxIdleInterval activity check trigger after 25s

paddybyers commented 3 years ago

But this is completely reproducible. It doesn't sound like a connection randomly dropping

KacperKluka commented 3 years ago

Well, this happens from time to time, but not always :thinking: I've tried to find a pattern in which this occurs but with no luck.

KacperKluka commented 2 years ago

I think that I've discovered why this is happening. The issue is not with the presence.get(true) itself but with the place from which we call it. In my case, this method was called directly from a callback of the presence.enter() method. This means that the synchronous wait will happen on the WebSocketConnectReadThread which is the same thread that calls the enter()'s callback.

My guess is that we have a race condition that sometimes leads to a deadlock. The get(true) blocks the thread and waits for data that is supposed to arrive on the thread that is blocked. When I call the presence.get(true) from another thread then everything works fine.

Is it documented somewhere that we shouldn't call this method from the callbacks of other ably-java methods?

paddybyers commented 2 years ago

Good spot. It's not documented I think.

AndyTWF commented 1 year ago

Just to tie in the "why 2 minutes" conversation, a separate bug was found at https://github.com/ably/ably-java/issues/932 where a silently dead connection would never time out due to lack of activity (only when the underlying transport itself timed out)