enisdenjo / graphql-ws

Coherent, zero-dependency, lazy, simple, GraphQL over WebSocket Protocol compliant server and client.
https://the-guild.dev/graphql/ws
MIT License
1.74k stars 157 forks source link

Client to server ping message #117

Closed michelepra closed 3 years ago

michelepra commented 3 years ago

Browser is very lazy to detect websocket disconnect, it's responsive only when client try to send some frame to server. To do this with current graphl-ws protocol i need query exposed in * .graphl service that is only need for websocket transport and only for keep alive. This is a small hack use of graphql due to protocol absence of this feature. subscriptions-transport-ws has similar concept but managed from server to client

This issue is for discussion of new message type that client or server can periodically sent (with payload to make more complete) similar to connection_init. The response that client or server send can be customized with payload, similar to connection_ack

andyrichardson commented 3 years ago

@michelepra I'm 100% with you on the need for this πŸš€

~This kind of heartbeat/ping-pong looks to be a recommended approach.~

Edit: Bad link - see application layer example here

michelepra commented 3 years ago

This kind of heartbeat/ping-pong looks to be a recommended approach.

This kind is only available on server side and is already mentioned in #10 What I have proposed is not related to websocket (that ping/pong is) but in application layer transport protocol (that graphl-ws is)

andyrichardson commented 3 years ago

My bad - yeah we are talking about the same thing, I linked to the wrong resource πŸ‘

Application layer ping/pong added to the graphql-ws protocol so we can determine whether messages are being sent/received (without consideration for TCP timeouts)

pranaypratyush commented 3 years ago

This thing is SUPER important for me as well.

enisdenjo commented 3 years ago

For keeping the client alive, you should rely on the Pings and Pongs: The Heartbeat of WebSockets. Pong frames are automatically sent in response to ping messages as required by the spec. This feature is a mandatory built-in, all browsers that don't support Ping and Pong are simply not RFC6455 compliant.

Also, it is a part of this lib already: https://github.com/enisdenjo/graphql-ws/blob/94ea423be34572dc363d1c6abab4404e2e83be55/src/use/ws.ts#L64-L87

Please beware, these ping-pong frames are built-in! Most browsers probably wont show them in the network developer tools. If you do indeed want to see them: use Firefox (it does expose these WS Protocol frames).

Application layer ping/pong added to the graphql-ws protocol so we can determine whether messages are being sent/received (without consideration for TCP timeouts)

@andyrichardson why would you need a message to determine this? If a WS connection is open, you're guaranteed to have the messages flow through. Also, you'd have to send that "are-your-there?" message in the first place - so you're checking if messages are sent/received by sending a message.

Browser is very lazy to detect websocket disconnect, it's responsive only when client try to send some frame to server.

@michelepra if WS connections close prematurely at any point in time, graphql-ws will reconnect silently and automatically. On the other hand, the servers pings should keep the browser around.

Nevertheless, these types of checks go well beyond the Protocol (and thus graphql-ws because it is a minimal reference implementation).

P.S. We had a concrete message before, but it was cut off in favour of the specified ping-pongs. Check the PR out: https://github.com/enisdenjo/graphql-ws/pull/11.

andyrichardson commented 3 years ago

@enisdenjo thanks for the response πŸ™

If a WS connection is open, you're guaranteed to have the messages flow through

I can't speak for everyone else but this is what I'm struggling to guarantee.

If you spin up this sandbox and disconnect from the network (wifi off, unplug ethernet, etc) you'll see that onClose doesn't get called... at least not for 10 minutes πŸ˜…

The lack of pong from the client will allow the server to know that the client is unresponsive, but the same can't be said the other way round. The client will consider the socket to be alive until the TCP timeout is reached (even though the connection may have been closed minutes earlier by the server).

My guess here is that we need some kind of mechanism for the client to know that the server is responsive (which doesn't look to be an option on the ws layer - at least not in JS land)

Edit: There's a good explanation here about the need for application layer pinging for client side aliveness checks

michelepra commented 3 years ago

P.S. We had a concrete message before, but it was cut off in favour of the specified ping-pongs. Check the PR out: #11.

Server side ping/pong can be managed and is that is implemented in #11

Client side ws !== js websocket and here we cannot send ping frame. Browser cannot detectet network failure until we try to put some frame on opened connection. Using concrete message is the only thing that solve network failure detection in client side. This can be reached with specific graphql query, that is out of graphl-ws protocol and is a bit evil use of graphql, or can be managed by protocol using specific message type like for example

interface ConnectionPingMessage {
  type: 'connection_ping';
  payload?: Record<string, unknown>;
}
interface ConnectionPongMessage {
  type: 'connection_pong';
  payload?: Record<string, unknown>;
}
enisdenjo commented 3 years ago

Edit: There's a good explanation here about the need for application layer pinging for client side aliveness checks

@andyrichardson the comment you referenced simply describes the lack of a JS API for ping-pongs. This is intentional because the browser should be handling the connections, not the applications.

The main reason for delayed connection checks is resource conservation (idling browsers have their JS paused). Adding to your example, if you tried issuing a GQL operation (send a WS message) the browser will re-evaluate the connection and notice that the socket is no more. It will then report the closure and graphql-ws will handle the rest (reconnect silently) - the user will notice nothing.

Even if we add the "ping-pong" concrete message, it will suffer from the exact same problem. The browser will go in idling state (which you cant and shouldnt control), you'll try pinging a server, the browser will notice the connection is gone and graphql-ws will kick in again in the exact same way.

Given the current set of laid out arguments, a ping-pong message is something I am not considering adding. Sorry to be blunt guys, but - you're overcomplicating it!

If you would like to discuss this in further detail, or you would like to raise more arguments: please consider leaving a comment over at the PR for standardising this GraphQL over WebSocket Protocol and have the rest of the brilliant developers engage.

michelepra commented 3 years ago

Even if we add the "ping-pong" concrete message, it will suffer from the exact same problem. The browser will go in idling state (which you cant and shouldnt control), you'll try pinging a server, the browser will notice the connection is gone and graphql-ws will kick in again in the exact same way.

This is the purpose of this issue. If the browser will not notice the connection is gone we never kick in againg until we explicitly call some operation.

User experience in this scenario is very poor because we cannot tell the connection is gone and data that we expect to receive from subscription is never receiving. Ther's more use case when subscription is call once and then put aside until it receives data.

Is more elegant and minor data consumption that this is done by protocol instead of user specific operation.

PS. I see in PR one screenshot of spotify websocket like an example, and notice that ping/pong is also exposed in application layer. WTF why they do this!?!? If connection lost they can tell as soon as possibile to user that connection is broken. This shows how this necessity is more common than one would like to believe

enisdenjo commented 3 years ago

@michelepra, @andyrichardson, I think I am getting the hang of it now. Let me try summarising:

The client might take too long to notice that the connection is gone (for whatever reason), this is bad UX because the user will have to wait for the "browsers time" to reconnect (which might take 10mins+ as @andyrichardson points out).

In this case, a client ping-pong does indeed make sense. Sorry for taking me too long to realise this, glad you guys stuck around! πŸ˜…

I want to do a bit more research before I approach this enhancement. But, I will most certainly consider this!

enisdenjo commented 3 years ago

@michelepra, I see you'd like to have a payload accompany the ping/pong messages. Could you please elaborate a bit more on how and why would this payload be helpful?

michelepra commented 3 years ago

The client might take too long to notice that the connection is gone (for whatever reason), this is bad UX because the user will have to wait for the "browsers time" to reconnect (which might take 10mins+ as @andyrichardson points out).

Even worse, user cannot know that must wait reconnecting, because in it's point of view all is ok until websocket is closed, and this is done only when try to send some frame. This is the reason of having ping/pong in graphql-ws protocol

@michelepra, I see you'd like to have a payload accompany the ping/pong messages. Could you please elaborate a bit more on how and why would this payload be helpful?

I think in more than one use case of payload.... the simplest is for logging, the hardest is for event driven ping/pong. Payload must be optional because it's purpose is in application layer logic, not in protocol

andyrichardson commented 3 years ago

Sorry to be blunt guys, but - you're overcomplicating it!

While savage, I totally agree

The client might take too long to notice that the connection is gone (for whatever reason), this is bad UX because the user will have to wait for the "browsers time" to reconnect (which might take 10mins+ as @andyrichardson points out).

Yup this is exactly the problem I'm stuck with.

My current workaround is to use the online state and assume that if online is true and the socket is open, we will be receiving messages in real time.

If we go offline or the socket is closed, then we communicate to the user that they won't be receiving updates.

Does this cover all use cases? That's what I'm not so confident about πŸ€”

enisdenjo commented 3 years ago

Yeah, I always favour good UX and DX; but, I had a hard time understanding the needs for this issue.

Nevertheless, it is pretty clear now and after I did my research I'll come back with a PR (or more questions).

enisdenjo commented 3 years ago

Doing further research I feel like I will indeed stay behind my initial decision - NOT supporting client->server pings. The reason is quite simple, please bear with me.

Network loss does not necessarily mean that the socket needs to be closed, as mentioned in a chromium issue report comment, simply because:

... temporal disconnected sockets might recover and continue to send and receive subsequent packets without any packet loss. Timeout depends on OS.

For example, disable WLAN and re-enable it with the same IP address before it become timeout. Your dammed sockets will be back to send and receive. At that time, both of client and server could not detect closed connection. They can continue communication without any additional opening handshake or session recovery.

In essence, the connection might recover on its own and continue as nothing had happened. Since the lack of (or delayed) connection close events, you guys are mentioning, can occur really only on network disconnects - adding a client->server ping for that case exclusively is not something I want to cover because of the aforementioned arguments.

Furthermore, doing a ws.send while the network is down WILL NOT report the error right away. It will still wait for the OS to settle and decide that the underlaying TCP connection is really down before reporting it through the browser. So, our solution (pinging server from the client) to catch network disruptions as soon as possible WONT work either, you're still stuck with having the OS decide that the socket is closed...

What will happen actually is that the send packets will get queued, and once the network is up again, they'll be sent through. However, if the network never recovers, the OS/browser will report a close event in a timely fashon.

A demonstration of a temporal network down without the socket actually getting closed is below. Please note how doing a ws.send while offline will NOT report a close event and the queued messages will go through once the network recovers.

temporal-network-down

michelepra commented 3 years ago

Network loss does not necessarily mean that the socket needs to be closed, as mentioned in a chromium issue report comment

This is not the purpose of this issue. Closing socket must not be fired by graphql-ws client out of application logic due to missing pong response, but by browser/OS that detect tcp connection is down. The behaviour of force close websocket and run reconnecting must be in application logic if for application this is requested and managed.

... temporal disconnected sockets might recover and continue to send and receive subsequent packets without any packet loss. Timeout depends on OS.

Exactly. Reported issue of chrome (that is very old) tell behave differently depending on the OS and that's why the state has been forced WontFix because is not a bug of websocket spec implementation. For example on Windows it throws close event after 60-70 seconds (now with chrome 88 on windows 10 is more short), on linux (??), chromeOS never fired... etc etc

In essence, the connection might recover on its own and continue as nothing had happened. Since the lack of (or delayed) connection close events, you guys are mentioning, can occur really only on network disconnects - adding a client->server ping for that case exclusively is not something I want to cover because of the aforementioned arguments.

This is not quoted here but ping/pong in application layer logic is not only exclusively for detect network disconnects but also for keep connection active as toyoshim say in comment in chrome issue.

If you want to keep the connection in active, you can implement application-level ping-pong or heartbeats checker with your favorite intervals.

For example a few years ago i developed web application where communication with server is based on websocket. Client do some requests that on server was managed similar as graphql subscription. Server was write in java and based on jetty websocket that implement and manage ping/pong websocket spec. Whithout interval messages in application layer like ping from client the socket on server was closed (and session associated closed) and client must continuosly reopen new websocket and rerun previous requests.

So, our solution (pinging server from the client) to catch network disruptions as soon as possible WONT work either, you're still stuck with having the OS decide that the socket is closed...

If application based on graphql-ws wait only close event to catch network disruptions it's partly true due to different OS behaviour. With this in mind client application can settle down some logic based on ping/pong (that now is not possible due to the lack of this in the graphql-ws protocol, this only can be simulated with graphql query made specifically for this purpose)

What will happen actually is that the send packets will get queued, and once the network is up again, they'll be sent through. However, if the network never recovers, the OS/browser will report a close event in a timely fashon.

As quoted above this is based on OS where browser run

andyrichardson commented 3 years ago

Thanks for looking into this @enisdenjo

So, our solution (pinging server from the client) to catch network disruptions as soon as possible WONT work either, you're still stuck with having the OS decide that the socket is closed...

I think sending ping/pong messages would still be a valid approach.

The example flow I would have expected would be:

Alive

Dead

michelepra commented 3 years ago
Show `example` > **Alive** > > * Client sends ping message (we don't care whether the server gets it or not) > * setTimeout is introduced which closes the socket after X seconds > * pong message is received and timeout is cleared > * repeat > > **Dead** > > * Client sends ping message (we don't care whether the server gets it or not) > * setTimeout is introduced which closes the socket after X seconds > * pong message is not received within X seconds, close socket and state is now updated

@andyrichardson the example can be ok but be careful of chained timers

enisdenjo commented 3 years ago

Guys, you're missing one key point, the connection might recover. Terminating it early will stop all subscriptions and discard all messages that the server might have sent (or the client) in the meantime because it does not know that the client went away yet. By allowing the OS/browser to control the connection, you will receive all queued up messages if the connection recovers; and if not, you'll get a close event and the client will reconnect silently.

Another crucial thing you're overlooking is that you want to close the connection early while the user is offline. No retries or ping-pongs will put the user back online... You might even experience downsides, like for example closing the connection unnecessarily (OS/browser would've recovered), having the client retries get exceeded and then fail for good.

Why do you want to retry while the user is offline? What do you gain by reporting an error early (and even unnecessarly)?

On the other hand, if the OS/browser manages to recover the connection when the user comes back online - messages from both sides will be flushed; and if it does not manage to recover, it will report a close event AFTER the user came back online - allowing the client to retry exactly when it should.

This is not quoted here but ping/pong in application layer logic is not only exclusively for detect network disconnects but also for keep connection active as toyoshim say in comment in chrome issue.

@michelepra the connection will be kept alive because of the server pings as long as the client is online.

michelepra commented 3 years ago

Guys, you're missing one key point, the connection might recover. Terminating it early will stop all subscriptions and discard all messages that the server might have sent (or the client) in the meantime because it does not know that the client went away yet. By allowing the OS/browser to control the connection, you will receive all queued up messages if the connection recovers; and if not, you'll get a close event and the client will reconnect silently.

Server side is not browser, here connection lost can be detected with websocket ping/pong frame, when connection is closed all subscriptions complete and new messages are discarded Client side will receive all queued up messages (if server side not detect connection lost) when connection recovers, but we will never had close event when necessary or detecting server is down until we send ping message to server

Another crucial thing you're overlooking is that you want to close the connection early while the user is offline. No retries or ping-pongs will put the user back online... You might even experience downsides, like for example closing the connection unnecessarily (OS/browser would've recovered)

Close the connection is only one of possibilities. This must be in client side application logic, not in graphql-ws. For example you consider unnecessarily to closing the connection because browser would've automatically recovered. This point of view can be true for one application but not for other, use case can be very different. Therefore it would be better not to consider the examples when defining a transport protocol such as graphql-ws. What people do with client side ping/pong must be out of protocol definition

having the client retries get exceeded and then fail for good.

This means severe network problems (like network devices broken or ISP down) and this must be managed in application, How to do this if we never had close event or we never receiving pong message? Easy, now we cannot because we haven't ping/pong client message

Why do you want to retry while the user is offline? What do you gain by reporting an error early (and even unnecessarly)?

If client application is real time data based, we must consider to notify user as soon as possible that is not online.

On the other hand, if the OS/browser manages to recover the connection when the user comes back online - messages from both sides will be flushed; and if it does not manage to recover, it will report a close event AFTER the user came back online - allowing the client to retry exactly when it should.

And if client never, or very long time, came back online? Now we must considering we cannot show user that is not online due to the lack of client side ping/pong

@michelepra the connection will be kept alive because of the server pings as long as the client is online.

I reported this as example, some year ago (and chrome issue is 10 year) this happend on server side where ping/pong of websocket spec was correctly implemented. I will not exclude this was bug of jetty websocket servlet, but this is not the subject of this issue

asmeikal commented 3 years ago

I'm leaving my two cents here based on my experience: the need for early detection of connection loss, and thus for some kind of ping from the client, is useful when a connection loss triggers a fallback. In the project I'm working on subscriptions update continuously the resources that the user is working on, and if we detect a socket disconnection we start polling these resources, as other services may still be working. Detecting earlier that the connection is closed even when it could be re-established (without message loss) is something that can improve the user experience: instead of seeing the client not updating, the fallback can kick in and update the resources.

But I believe that this should not be a protocol concern, and it can be implemented relatively easily with queries at regular intervals using the same client as the subscriptions, or a subscription that sends messages regularly from the server paired with a timeout on the client.

michelepra commented 3 years ago

But I believe that this should not be a protocol concern, and it can be implemented relatively easily with queries at regular intervals using the same client as the subscriptions, or a subscription that sends messages regularly from the server paired with a timeout on the client.

@asmeikal Is exactly that i said above with "this only can be simulated with graphql query made specifically for this purpose" (and that's what I did because I have no alternative, and the reason why I opened this issue)

I do not agree with "this should not be a protocol concern" because expose and use query (that is message protocol) only to monitoring websocket is an abuse due to the lack on graphql-ws (that is transport protocol) of concrete message type to gain this result.

enisdenjo commented 3 years ago

Client side will receive all queued up messages (if server side not detect connection lost) when connection recovers, but we will never had close event when necessary or detecting server is down until we send ping message to server

@michelepra Actually, until the pong message receive times out. Server downs will be detected immediately, only network disruptions of the client will stall. And for a good reason, they might recover and the messages flush.

Close the connection is only one of possibilities. This must be in client side application logic, not in graphql-ws.

@michelepra You are right, this is application, subjective, logic - something graphql-ws shouldnt be responsible for.

For example you consider unnecessarily to closing the connection because browser would've automatically recovered. This point of view can be true for one application but not for other, use case can be very different. Therefore it would be better not to consider the examples when defining a transport protocol such as graphql-ws. What people do with client side ping/pong must be out of protocol definition

@michelepra Can you give me one other example? Just for the example sake.

This means severe network problems (like network devices broken or ISP down) and this must be managed in application, How to do this if we never had close event or we never receiving pong message? Easy, now we cannot because we haven't ping/pong client message

@michelepra Simple, outside of graphql-ws. Just listen for the online and offline events. You absolutely should not rely on a transport implementation (or protocol) for detecting online/offline statuses.

If client application is real time data based, we must consider to notify user as soon as possible that is not online.

@michelepra Again, outside of graphql-ws. There are other primitives that detect online status, like online and offline events for browsers, and other utilities for NodeJS.

And if client never, or very long time, came back online? Now we must considering we cannot show user that is not online due to the lack of client side ping/pong

@michelepra still a use case for online and offline events.

I am not in favour of adding an additional protocol requirement just to detect the online status... If you want absolutely want to kick the client off as soon as the browser goes offline, you can simply do:

import { createClient } from 'graphql-ws';

let activeSocket;
const client = createClient({
  url: 'wss://i.close/on/offline/event',
  on: {
    connected: (socket) => {
      activeSocket = socket;
    },
    closed: () => {
      activeSocket = undefined;
    },
  },
});

window.addEventListener('offline', () => {
  activeSocket?.close(1001, 'Going Away');
});

This πŸ‘† would be more reliable than any ping-pong message a protocol (or an implementation) could offer. You can even go one step further and setTimeout on the activeSocket.close to apply a delay. Note that closing the connection will also toggle the clients silent retries, so you're not loosing any subscriptions or state.

However, I would absolutely not recommend this, you should leave the heavy lifting to the OS/browser.

But I believe that this should not be a protocol concern, and it can be implemented relatively easily with queries at regular intervals using the same client as the subscriptions, or a subscription that sends messages regularly from the server paired with a timeout on the client.

@asmeikal You dont even need queries, the example above achieves exactly what this issue is describing.

enisdenjo commented 3 years ago

Quick heads up, I am marking this issue as wontfix since I feel that the presented arguments are not sufficient for pushing this change. I also changed the title just so that it is a bit more clear about what this issue is describing.

However, I will still keep this issue open. It can be that I am missing some details or am overseeing crucial benefits. I might be stubborn, but am not blind. Given enough real-world examples and well-established arguments, I will most certainly change my mind in favour of the requested changes. Sorry to everyone if I am a bit annoying, but I just want to make sure we are paving the pathway in the right direction!

To all future readers of this issue, I am welcoming you all to drop your two cents!

enisdenjo commented 3 years ago

πŸ‘‹ to everyone! Hope you find the following information helpful.

I've added a "client to server ping" recipe explaining how to implement a client to server ping/pong with timeout detection, custom request/response data support and latency measurement.

The server implementation leverages a custom pinger schema which is used when the client issues a ping request. Since the pinger schema substitutes your schema to fulfil the request, you're free to design the ping/pong data requirements on the fly without ever polluting your official GraphQL schema.

On the other side, the client implements a ping sent following the specified interval with a pong wait timeout that closes the socket as soon as it gets exceeded, inducing silent reconnection. Together with this, the pinged and ponged times are recorded for calculating the latency.

Note that the recipe is a showcase implementation and you should use it as a boilerplate for satisfying your exact ping/pong needs.

Even though you can check the recipe out, I'll share the implementation here for convenience.

// πŸ›Έ server

import {
  GraphQLSchema,
  GraphQLObjectType,
  GraphQLNonNull,
  GraphQLString,
} from 'graphql';
import ws from 'ws'; // yarn add ws
import { useServer } from 'graphql-ws/lib/use/ws';
import { schema } from './my-graphql-schema';

// a custom graphql schema that holds just the ping query.
// used exclusively when the client sends a ping to the server.
// if you want to send/receive more details, simply adjust the pinger schema.
const pinger = new GraphQLSchema({
  query: new GraphQLObjectType({
    name: 'Query',
    fields: {
      ping: {
        type: new GraphQLNonNull(GraphQLString),
        resolve: () => 'pong',
      },
    },
  }),
});

const wsServer = new WebSocket.Server({
  port: 443,
  path: '/graphql',
});

useServer(
  {
    schema: (_ctx, msg) => {
      if (msg.payload.query === '{ ping }') return pinger;
      return schema;
    },
  },
  wsServer,
);
// πŸ“Ί client

import { createClient } from 'graphql-ws';

let connection: WebSocket | undefined;
const client = createClient({
  url: 'wss://client.can/send-pings/too',
  on: {
    connected: (socket) => (connection = socket as WebSocket),
    closed: () => (connection = undefined),
  },
});

async function ping() {
  // record the ping sent at moment for calculating latency
  const pinged = Date.now();

  // if the client went offline or the server is unresponsive
  // close the active WebSocket connection as soon as the pong
  // wait timeout expires and have the client silently reconnect.
  // there is no need to dispose of the subscription since it
  // will eventually settle because either:
  // - the client reconnected and a new pong is received
  // - the retry attempts were exceeded and the close is reported
  // because if this, the latency accounts for retry waits too.
  // if you do not want this, simply dispose of the ping subscription
  // as soon as the pong timeout is exceeded
  const pongTimeout = setTimeout(
    () => connection?.close(4408, 'Pong Timeout'),
    2000, // expect a pong within 2 seconds of the ping
  );

  // wait for the pong. the promise is guaranteed to settle
  await new Promise<void>((resolve, reject) => {
    client.subscribe<{ data: { ping: string } }>(
      { query: '{ ping }' },
      {
        next: () => {
          /* not interested in the pong */
        },
        error: reject,
        complete: resolve,
      },
    );
    // whatever happens to the promise, clear the pong timeout
  }).finally(() => clearTimeout(pongTimeout));

  // record when pong has been received
  const ponged = Date.now();

  // how long it took for the pong to arrive after sending the ping
  return ponged - pinged;
}

// keep pinging until a fatal problem occurs
(async () => {
  for (;;) {
    const latency = await ping();

    // or send to your favourite logger - the user
    console.info('GraphQL WebSocket connection latency', latency);

    // ping every 3 seconds
    await new Promise((resolve) => setTimeout(resolve, 3000));
  }
})();
idevelop commented 3 years ago

Hey folks! In our app we currently use Apollo with subscriptions-transport-ws (for all queries, not just subscriptions), and I've been noticing that the app fails to reliably reconnect when the computer resumes from a long sleep.

I'm still early in my investigation so I haven't completely narrowed down the issue. At first I suspected the exponential backoff was causing it to take too long to retry, so I capped the max timeout to 2 seconds, but it doesn't seem to have helped. It actually seems like the websocket still thinks it's open (I can see the client sending operations in the WS tab, but no response is received), so I think my problem looks a lot like what you're solving for here (temporary network interruption).

@enisdenjo this thread has been super useful! I'm wondering if this specific flow (closing the laptop and reopening it after a long while) is something you considered when evaluating what to do here, and if you have any advice on how to resolve it.

In my opinion it's strongly desirable for the client to identify ASAP that the websocket is useless, and reconnect it, and IMO this behaviour should be part of the library rather than an app-level concern.

I understand that leaving the connection alone means there's a chance it resumes without reconnecting, and you might send/receive flushed messages you otherwise would have lost, but I don't think that's a good enough reason to risk a (potentially long) "stale" period, effectively lying to the app about the connection state.

Subscriptions by definition only receive events happening during the subscription (and implicitly the connection). If an interruption in the network/connection happens, you might lose some events, that's life. It's up to the app to decide if that interruption needs to be handled somehow. But to do that, it needs to know about it ASAP. If it doesn't, and it thinks it's still connected and simply nothing is happening, that's super hard to build a good user experience around. My point is: the app should already be solving for potentially missing events, and the library shouldn't try to make that more reliable at the expense of a potentially long stale period.

sneko commented 3 years ago

Hi all :)

I did not know about browsers having to implement as built-in a response pong when receiving the ping. Thought it would be only the library as some do. Glad to hear it!

But... in the meantime @enisdenjo I'm wondering: if WS transport librairies like yours considers delegating that to the browser, what will happen if I use your library inside a Node.js program? Do you know if the Node.js core will take care of managing as "built-in" the automatic pong reply?

Thank you,

enisdenjo commented 3 years ago

@enisdenjo this thread has been super useful! I'm wondering if this specific flow (closing the laptop and reopening it after a long while) is something you considered when evaluating what to do here, and if you have any advice on how to resolve it.

@idevelop any form of connection loss is considered. If the user closes the laptop and the laptop does not immediately disconnect, it can recover if opened in time. As mentioned in https://github.com/enisdenjo/graphql-ws/issues/117#issuecomment-784352250, all pending messages will be flushed when the underlying TCP connection is recovered.

In my opinion it's strongly desirable for the client to identify ASAP that the websocket is useless, and reconnect it, and IMO this behaviour should be part of the library rather than an app-level concern.

@idevelop the only reliable way of immediately detecting a connection loss, as mentioned in https://github.com/enisdenjo/graphql-ws/issues/117#issuecomment-785261086, is through online and offline events. You can find an example also in https://github.com/enisdenjo/graphql-ws/issues/117#issuecomment-785261086 that immediately gets rid of the socket when user goes offline and starts the silent retries (even though I wouldnt recommend this because you should let the browser decide when the socket is really gone).

It's up to the app to decide if that interruption needs to be handled somehow. But to do that, it needs to know about it ASAP. If it doesn't, and it thinks it's still connected and simply nothing is happening, that's super hard to build a good user experience around.

@idevelop as mentioned, you want to know this ASAP, so ping/pongs wont help since they are messages too and they also fall into the category of being flushed through a recovered connection... The only way to detect connection loss immediately, really, is to rely on online and offline events. There is no isomorphic way of solving this inside the library, and I also dont think the library should even check the internet connection as it falls out of its scope - graphql-ws is about reliable transport.

if WS transport librairies like yours considers delegating that to the browser, what will happen if I use your library inside a Node.js program? Do you know if the Node.js core will take care of managing as "built-in" the automatic pong reply?

@sneko ping/pongs are a part of the RFC standard, a client MUST respond with a ping to a pong ASAP to be spec compliant. To specifically answer your question, browsers do so and the ws library client does so automatically too.

backbone87 commented 3 years ago

I think being able to ensure that the connection satisfies an "SLA" (interval, timeout, latency, "stability" etc) where the client can detect violations is a useful feature for many applications especially SPAs.

While the WebSocket RFC allows client-initiated pings, the HTML WebSocket API does not require exposure of any way to declare such SLAs and leaves client-initiated pings entirely to the discretion of the user-agent. https://html.spec.whatwg.org/multipage/web-sockets.html#ping-and-pong-frames

The WebSocket protocol defines Ping and Pong frames that can be used for keep-alive, heart-beats, network status probing, latency instrumentation, and so forth. These are not currently exposed in the API.

User agents may send ping and unsolicited pong frames as desired, for example in an attempt to maintain local network NAT mappings, to detect failed connections, or to display latency metrics to the user. User agents must not use pings or unsolicited pongs to aid the server; it is assumed that servers will solicit pongs whenever appropriate for the server's needs.

(edit: I also don't fully understand what "User agents must not use pings or unsolicited pongs to aid the server" should mean? "Aid" in the sense that the client should not "help" the server to detect connection liveliness by sending pings? But this does not exclude client-initiated pings in order to detect connection liveliness because the client needs this information?)

While we may -- and IMHO should -- get exposure within the HTML WebSocket API, it may take years until it's widely available. This leaves only the 2 places where this may be done: graphql-transport-ws or at application level. While GraphQL itself is transport agnostic, graphql-transport-ws explicitly deals with transport concerns. So this would be a good place to define transport-related "SLAs". It is also error-prone and requires quite a bit of boilerplate to implement this on the application level, especially since it can be avoided to ping the server on a currently "chatty" connection. Of course, in an ideal world, it would just delegate this to the HTML WebSocket API.

enisdenjo commented 3 years ago

Hey @backbone87, thanks for joining the discussion. Some of your concerns have already been discussed in this thread.

In order to reliably and immediately detect connection loss (socket liveliness), you cannot rely on application level ping/pongs since they are messages too and they will also be affected by OS/browser timeouts and flushed through a recovered connection... The only way to detect connection loss immediately, really, is to rely on online and offline events. If your intention is timing out a pong after a client initiated ping, this begs the question of how much different it actually is from the online and offline events. Also, is the increase of complexity in the lib worthwhile?

I still havent seen a real-life scenario where client initiated pings are really necessary for connection checks. Browsers are pretty reliable in invoking the close callbacks. Maybe you have an idea or a suggestion; or even better, a repro where a client ping can detect a connection loss more reliably than browser APIs? (Please dont include ​devices manually disconnecting from the internet as an example)

About the connection recovery and message flushing I mentioned, please refer to https://github.com/enisdenjo/graphql-ws/issues/117#issuecomment-784352250.

Furthermore, latency and ​"stability" are indeed good reasons for a client initiated ping/pong; my argument is that if you want it only because of the 2 aforementioned reasons - it is seems better fitted in user-land because you can design your own latency measurer through GraphQL and custom client usage. (Quick start here: https://github.com/enisdenjo/graphql-ws/issues/117#issuecomment-805256188)

User agents must not use pings or unsolicited pongs to aid the server; it is assumed that servers will solicit pongs whenever appropriate for the server's needs.

My interpretation is that doing server-side probing must be initiated by the server, not the client. Meaning, if the server is measuring latency, for example, it MUST be the one that issues the ping and calculates the pong wait time; the client should not dictate these measurements by issuing a ping and having the server do calculations on-demand.

Connection liveliness is something that both sides should handle. Server itself for the server-side and the mediator (ex. browser) for the client side. I am arguing that you should let the browser decide when the socket is unrecoverably gone and use browser APIs (like online/offline events) to reliably and quickly react on network failures as a side-effect. (Closing a WebSocket as soon as the browser goes offline is demonstrated here: https://github.com/enisdenjo/graphql-ws/issues/117#issuecomment-785261086)

backbone87 commented 3 years ago

Thank you for the thorough response, but it would not have been necessary since I read the thread and I did not want to steal your time. I just wanted to present the argument that connection liveliness & stability are concerns of the transport protocol and should be handled within, at best in a configurable manner ("SLAs") and that the graphql-transport-ws protocol already has a lot of information present (RTTs of queries, date of most recent server message), which it can reuse to provide a more "clean" approach compared to the application-level ping/pong.

If your intention is timing out a pong after a client initiated ping, this begs the question of how much different it actually is from the online and offline events. Also, is the increase of complexity in the lib worthwhile?

Ideally, Online/Offline events would be a part of this detection, but they only work if the client's internet gateway is gone, not if there are problems somewhere between the internet gateway and the server.

From a userland code perspective, I want to delegate transport-related SLAs to the transport protocols. As a GraphQL client I may not know about WebSockets themselves and definitely do not know about TCP since this is hidden away completely by the UA. What I do know as a GraphQL client is what the GraphQL transport protocols support (HTTP or graphql-transport-ws). They may in turn delegate SLA enforcement to lower-level transports themselves, but since the HTML WebSocket API does not expose client-initiated ping and does not have a concept of "SLA" at all, it most likely can not.

This may not be the strongest of arguments that this should be a "top-priority" issue, but its definitely a transport concern.

enisdenjo commented 3 years ago

You're right, it is indeed a transport concern... One thing I dont like is overcomplicating things, which is why I am so "hard" to sway. There must be a good reason why browsers dont allow sending pings, and I want to know the "why?".

You're also right about the fact that client (browser in this case) initiated pings are not on the HTML standard roadmap and will most probably never be (or a long time will pass until proper adaptation and support).

If we decide to support client initiated pings, there are (at least) two things we have to worry about now, and I welcome you all to put your 2 cents in:

backbone87 commented 3 years ago

I am so "hard" to sway.

I think that is a good thing!

Regarding the design:

Take my design comments with a grain of salt, since I (like many I guess) do not use the GraphQL WS client directly, but instead consume it via other clients like Apollo.

edit:

There must be a good reason why browsers don't allow sending pings, and I want to know the "why?".

This is a really good question. Simpler interface to increase adoption speed? Maybe there are discussion histories available on the HTML spec or a question could be asked to someone involved in the HTML WebSockets spec?

edit2:

edit3: Seems like that the general tenor of browser implementors is to not touch the WebSocket API anymore and instead push WebTransport forward (which is probably 1-2 years away from broad adoption):

enisdenjo commented 3 years ago

Changing the subprotocol would be too much, the protocol is anyway still a PR over at graphql-over-http, which makes it fine to keep the way it is.

I think Ill just make it a breaking change. As you suggested, the client will be the one that issues the pings messages. Since most people use graphql-ws exclusively for the client-side, this will be OK. The migration steps will suggest that you have to upgrade the server to v5 for them to work properly together.

However, I also recon that I will leave it disabled anyway even for the major bump, since these client side pings are a niche case (as this thread suggests, and as all the reasons for not having it in the HTML standard).

About the implementations, thank you for the suggestions. Ill think about it and create something minimal, but very extensible.

enisdenjo commented 3 years ago

I'd just like to add that, from the browsers side, the reason why there is no such API is exactly because the browser is the one that makes sure the connection is healthy. There are numerous threads I read through (probably referenced from this thread somewhere too) where you can see devs arguing this exact same thing.

However, there is a catch, sometimes the WS ping-pongs are not even available from the server side. Like the AWS API Gateway for example (https://github.com/andyrichardson/subscriptionless/issues/3). This makes it enough of an argument to proceed with a subprotocol ping/pong message type.

enisdenjo commented 3 years ago

I think https://github.com/enisdenjo/graphql-ws/pull/201/commits/e43154f8d56d2587c4817e70b1e56b5da45b2c04 is just enough. The only thing that the client configures is the keep-alive timeout (when to issue a ping), all other logic is up to the implementor.

For example, a close-on-timeout + latency metrics implementation would look like this:

import { createClient } from 'graphql-ws';

let activeSocket,
  timedOut,
  pingSentAt = 0,
  latency = 0;
createClient({
  url: 'ws://i.time.out:4000/and-measure/latency',
  keepAlive: 10_000, // ping server every 10 seconds
  on: {
    connected: (socket) => (activeSocket = socket),
    ping: (received) => {
      if (!received /* sent */) {
        pingSentAt = Date.now();
        timedOut = setTimeout(() => {
          if (activeSocket.readyState === WebSocket.OPEN)
            activeSocket.close(4408, 'Request Timeout');
        }, 5_000); // wait 5 seconds for the pong and then close the connection
      }
    },
    pong: (received) => {
      if (received) {
        latency = Date.now() - pingSentAt;
        clearTimeout(timedOut); // pong is received, clear connection close timeout
      }
    },
  },
});

@michelepra, @andyrichardson, @pranaypratyush, @asmeikal, @idevelop, @backbone87 would love to hear your thoughts on this. πŸ˜„

enisdenjo commented 3 years ago

:tada: This issue has been resolved in version 5.0.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

enisdenjo commented 3 years ago

Summary behind the decision of adding the ping/pongs in the Protocol: https://github.com/graphql/graphql-over-http/pull/140#issuecomment-857452460.

michelepra commented 3 years ago

:love_you_gesture: a small step has been taken, i have some points to clarify:

  1. client ping must be send according to application logic, with your implementation this is out of client control. keepAlive parameter is only one point of this logic
  2. why you don't consider optional payload like ping/pong websocket protocol?
  3. message event must be emitted after check message type is not ping/pong because this type have specific event handler
enisdenjo commented 3 years ago

Hey hey, straight to the point:

  1. The idea behind the keep-alive is sending it on an interval. I felt like having something like client.ping is an overkill (for now at least). Do you have an example usage in mind?
  2. Keeping it simply, I dont think passing a payload through the ping and pong message is really necessary. The main use case is connection stability, latency measurement and, in general, network probing. Still, suggested example usage?
  3. As stated in the documentation: the message event is called for all valid messages received by the client. Mainly useful for debugging and logging received messages.
michelepra commented 3 years ago
  1. The idea behind the keep-alive is sending it on an interval. I felt like having something like client.ping is an overkill (for now at least). Do you have an example usage in mind?

i can fancy some example

  1. Keeping it simply, I dont think passing a payload through the ping and pong message is really necessary. The main use case is connection stability, latency measurement and, in general, network probing. Still, suggested example usage?

Most easy example is for logging purpose, server can logging receiving ping X from client Y Other example can be in client by sending ping with payload containing progression number, If all is ok (client <- network -> server) i expect to receive pong with this number after ping was sent, If network is unstable message is sent as soon network going on (as you correctly report). With payload i can do more sophisticated thing than reporting "receiving N consecutive pong"

  1. As stated in the documentation: the message event is called for all valid messages received by the client. Mainly useful for debugging and logging received messages.

Ok, thanks for explanation

enisdenjo commented 3 years ago

i can fancy some example

  • monitor idle connection, where client ping must be sent after N milliseconds of receiving last message from websocket, because i can correctly assume that if i receive messages the socket is active.
  • Can also be browser limitations on some condition
  • ping keepAlive interval can be variable by some logic (example after do http request to other place)
  • sending ping can be enable/disable by some conditions

Awesome! Makes perfect sense. ~I'll introduce a client.ping soon.~ Please read: https://github.com/enisdenjo/graphql-ws/issues/117#issuecomment-857590817.

Most easy example is for logging purpose, server can logging receiving ping X from client Y Other example can be in client by sending ping with payload containing progression number, If all is ok (client <- network -> server) i expect to receive pong with this number after ping was sent, If network is unstable message is sent as soon network going on (as you correctly report). With payload i can do more sophisticated thing than reporting "receiving N consecutive pong"

Agreed. A payload does indeed make sense.

enisdenjo commented 3 years ago

@michelepra I realised that a manual (and super extensible) client.ping is already possible by using the socket directly (same for client.ping). I'd like to avoid baking it in if not absolutely necessary.

The benefit of this approach is that you can customise your pinger to fit your exact needs (async ping/pong, throw error on closed connection, integrated responses, logging, and much more...).

This is how a super simple manual pinger client would look like:

import {
  createClient,
  Client,
  ClientOptions,
  stringifyMessage,
  PingMessage,
  PongMessage,
  MessageType,
} from 'graphql-ws';

interface PingerClient extends Client {
  ping(payload?: PingMessage['payload']): void;
  pong(payload?: PongMessage['payload']): void;
}

function createPingerClient(options: ClientOptions): PingerClient {
  let activeSocket;

  const client = createClient({
    disablePong: true,
    ...options,
    on: {
      connected: (socket) => {
        options.on?.connected?.(socket);
        activeSocket = socket;
      },
    },
  });

  return {
    ...client,
    ping: (payload) => {
      if (activeSocket.readyState === WebSocket.OPEN)
        activeSocket.send(
          stringifyMessage({
            type: MessageType.Ping,
            payload,
          }),
        );
    },
    pong: (payload) => {
      if (activeSocket.readyState === WebSocket.OPEN)
        activeSocket.send(
          stringifyMessage({
            type: MessageType.Pong,
            payload,
          }),
        );
    },
  };
}

However, what the client needs is a disablePong option so that you can respond manually when necessary and optionally attach a payload.

michelepra commented 3 years ago

looks nice, i think this can cover almost all use cases

enisdenjo commented 3 years ago

:tada: This issue has been resolved in version 5.1.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

enisdenjo commented 3 years ago

Both optional ping/pong payloads and the disablePong client option, together with a recipe of a custom pinger, is available starting from v5.1.0.

michelepra commented 3 years ago

@enisdenjo server don't send payload, see fix

enisdenjo commented 3 years ago

Hmm, I didnt realise you wanted the server to pong with the same payload as the ping. My original thinking was that you would implement the server response.

But, I think that the default behaviour for having the payload do a round trip makes sense. Let me see what I can do.

enisdenjo commented 3 years ago

:tada: This issue has been resolved in version 5.1.1 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

enisdenjo commented 3 years ago

v5.1.1 returns ping's payload through the response pong.