Clients not able to connect with error code 3502

planninpoker commented 1 year ago

Describe the bug I'm seeing some users failing to connect and the error code is always 3502. I've read the meaning in the documentation DisconnectStale issued to close connection that did not become authenticated in configured interval after dialing. but I don't understand what could be causing this. Any help would be amazing.

Versions

Centrifuge client version 5.0.1
Server version v5.1

Additional context I have a react application which sometimes fails to connect. Maybe 5/1000 users fails

 useEffect(() => {
    const client = new Centrifuge(centTransports, {
      getToken: handleConnectCent,
      emulationEndpoint: `https://${process.env.NEXT_PUBLIC_CENT_URL}/emulation`,
    });
    const sub = client.newSubscription(`retro-${formatUUID(retroId)}`);
    setClient(client);
    setSubscription(sub);
  }, [retroId]);

 useEffect(() => {
    if (client && subscription) {
      client.connect();
      subscription.subscribe();
    }
    return () => {
      if (client && subscription) {
        client.disconnect();
        subscription.unsubscribe();
        setClient(null);
        setSubscription(null);
      }
    };
  }, [client, subscription]);

FZambia commented 1 year ago

Hello, does it reproduce with local Centrifugo server? Can you provide steps to reproduce including server configuration? Which transport you are using?

In general this means that server did not receive first frame from client over WebSocket (or other transport) in 10 secs. But not sure which conditions may lead to this in normal situation - so asking questions above.

planninpoker commented 1 year ago

Hey @FZambia

I'm not able to reproduce it locally, it works for 99% of clients, but I've watched at-least 2 sessions lately where it just won't connect, and these 3502 errors are being thrown.

{
  "token_hmac_secret_key": "",
  "admin": true,
  "admin_password": "",
  "admin_secret": "",
  "api_key": "",
  "allowed_origins": [""],
  "http_stream": true,
  "sse": true,
  "emulation": true,
  "presence": true,
  "join_leave": true,
  "force_push_join_leave": true,
  "allow_subscribe_for_client": true,
  "allow_presence_for_subscriber": true
}

export const centTransports: TransportEndpoint[] = [
  {
    transport: "websocket",
    endpoint: `wss://${process.env.NEXT_PUBLIC_CENT_URL}/connection/websocket`,
  },
  {
    transport: "http_stream",
    endpoint: `https://${process.env.NEXT_PUBLIC_CENT_URL}/connection/http_stream`,
  },
  {
    transport: "sse",
    endpoint: `https://${process.env.NEXT_PUBLIC_CENT_URL}/connection/sse`,
  },
];

So the server is not responding to the connect? I was wondering if this could happen when websockets were blocked, but I figured one of the fallbacks would take over

planninpoker commented 1 year ago

I've just found there errors in the logs as well

[
  {
    level: "error",
    error: "unexpected EOF",
    time: "2023-10-17T13:44:56Z",
    message: "error reading body",
  },
  {
    level: "error",
    error: "unexpected EOF",
    time: "2023-10-17T13:45:13Z",
    message: "error reading body",
  },
  {
    level: "error",
    error: "unexpected EOF",
    time: "2023-10-18T13:03:23Z",
    message: "error reading body",
  },
  {
    level: "error",
    error: "unexpected EOF",
    time: "2023-10-20T09:29:17Z",
    message: "error reading body",
  },
  {
    level: "info",
    error: "unexpected end of JSON input",
    req: {},
    time: "2023-10-20T13:23:18Z",
    message: "can't unmarshal emulation request",
  },
];

FZambia commented 1 year ago

but I've watched at-least 2 sessions lately where it just won't connect

These specific clients can't never connect? Or sometimes these clients can connect, sometimes not?
How exactly do you "watch" these sessions?
Is it possible to enable debug logs for these clients to look at them? ("debug": true in centrifuge-js configuration)

I've just found there errors in the logs as well

Do they correspond to connect issues or it's just all the errors in logs you found? "unexpected EOF" may happen when client goes away before request is read - I suppose Centrifugo should suppress those.

The last error can't unmarshal emulation request is interesting - need to add more logging for it to understand what was sent. Does it correspond to connection issues or its only one log entry?

planninpoker commented 1 year ago

Hey @FZambia, I appreciate the help with this.

The clients can never reconnect. For context, my app is a free web app where users create a room, that they join, and vote on things. A room was created, which equates to a channel, and no one was able to connect.
I watch these sessions with https://clarity.microsoft.com. It allows you to watch user sessions.
Yeah I will turn on debugging now, to hopefully catch the next one. Is there something specifically worth looking for with debugging on?

I see, these were just the only errors that the server had logged, so thought it might be related. I had a hunch it might a corporate firewall or something, as it was likely a team using the site. I've just installed a firewall on my machine with squid, and the switch to ssr is working flawlessly

FZambia commented 1 year ago

The clients can never reconnect

A bit misleading... Can't connect at all or can't reconnect? Reconnect is a process after connection loss.

Is there something specifically worth looking for with debugging on?

All debug logs would be helpful, as there is no reproducer - probably they will give insights where to dig further

I've just installed a firewall on my machine with squid, and the switch to ssr is working flawlessly

Does it prevent WebSocket Upgrade requests? Or it allows Upgrade requests but blocks WebSocket frames after connection Upgrade?

planninpoker commented 1 year ago

Sorry typo. They cannot connect at all. I've tried both, and both work fine on my machine. Blocking websockets, and blocking the initial websocket upgrade request.

I've done some more digging into this one scenerio, and it looks like none of the users from this channel were able to use websockets. 5/8 of them were able to connect with http_stream, and the other 3 never connected successfully. I added tracking on the successful connection types awhile ago, to justify switching from pure websockets to centrifugal.

FZambia commented 1 year ago

Got it, well - I suppose this in general simplifies the task, because it would be much harder to find the root cause if it was occasional issues for those clients.

One idea I have at this point is that those users are behind a proxy which allows Upgrade, but blocks WebSocket frames sent after it. In this case we will get transport open event on the frontend - and centrifuge-js won't try to use fallbacks after it because it thinks connection to WebSocket was successful. Though in this case I believe client side timeout (5 sec by default) should fire first and cause disconnect from the client side before server decides to close connection with DisconnectStale reason. 2 questions:

Did you tune centrifuge-js timeout option or using default?
Did you observe 3502 in server logs or in frontend client logs?

A bit a shot in the dark now, debug logs can tell that I am thinking in the wrong direction.

FZambia commented 1 year ago

@planninpoker hello, still waiting for answers above and some logs from those clients, did you have a chance to address this?

planninpoker commented 1 year ago

Hey @FZambia

I did not fine tune the timeout option, so I am using the default.
I observed all of the errors on the client, so client logs.

I haven't experienced any more issues with other clients since raising this, so I stopped investigating. I'm happy to make this as closed, and assume it had something to do with the users.

FZambia commented 1 year ago

Got it, thx. Let's close for now then, need more information to understand what's going on and whether it was some issue with SDK or not. Since you see 3502 on client-side it means WS frames from server to client are working fine, hard to imagine situation when only server to client WS frames are allowed in proxy (but who knows!).

centrifugal / centrifuge-js

Clients not able to connect with error code 3502 #263