Open KacperKluka opened 3 years ago
cc @SimonWoolf: the above log contains the messages exchanged. Is this a library issue or a realtime issue?
Log shows attach() being called at 14:27:24, then absolutely nothing for two minutes, then at 14:29:23 the library notices it's not actually connected (websocket lib noticing lack of pong response), and connects, successfully resumes, and does a presence sync all within a second.
So I guess the connection had silently died without properly closing the tcp connection? That can happen sometimes, but -- question is why did the library take two minutes to notice that? Lack of activity should have triggered the maxIdleInterval activity check trigger after 25s
But this is completely reproducible. It doesn't sound like a connection randomly dropping
Well, this happens from time to time, but not always :thinking: I've tried to find a pattern in which this occurs but with no luck.
I think that I've discovered why this is happening. The issue is not with the presence.get(true)
itself but with the place from which we call it. In my case, this method was called directly from a callback of the presence.enter()
method. This means that the synchronous wait will happen on the WebSocketConnectReadThread
which is the same thread that calls the enter()
's callback.
My guess is that we have a race condition that sometimes leads to a deadlock. The get(true)
blocks the thread and waits for data that is supposed to arrive on the thread that is blocked. When I call the presence.get(true)
from another thread then everything works fine.
Is it documented somewhere that we shouldn't call this method from the callbacks of other ably-java methods?
Good spot. It's not documented I think.
Just to tie in the "why 2 minutes" conversation, a separate bug was found at https://github.com/ably/ably-java/issues/932 where a silently dead connection would never time out due to lack of activity (only when the underlying transport itself timed out)
Some time ago there was an issue that
channel.presence.get(true)
was blocking the program execution and it was fixed in this PR https://github.com/ably/ably-java/pull/669I've noticed that sometimes code execution still blocks on the same
channel.presence.get(true)
but for exactly 2 minutes. Maybe the code waits for something and then when a timeout is thrown it just continues?Here's a log from Ably SDK after calling the troublesome method:
┆Issue is synchronized with this Jira Task by Unito