Allow push queries to be streamed for more than ten minutes at a time

vcrfxia commented 3 years ago

Is your feature request related to a problem? Please describe.

Currently, a single push query may be streamed for a maximum of ten minutes at a time, after which the server closes the connection. ksqlDB should provide a way for users to stream push queries for longer if desired, while still not allowing disconnected clients to hold server resources for unreasonable amounts of time.

Describe the solution you'd like

It may make sense to allow users to specify that certain queries should have longer timeouts.

Details here need to be thought out more. There are two places today where the 10-minute timeout is enforced on the server side:

on the server itself: https://github.com/confluentinc/ksql/blob/9988e6af3266319e8caf978a904f5d55c8986623/ksqldb-rest-app/src/main/java/io/confluent/ksql/api/server/Server.java#L295
on the query stream writer for push queries issued via the old HTTP streaming endpoint: https://github.com/confluentinc/ksql/blob/9988e6af3266319e8caf978a904f5d55c8986623/ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/resources/streaming/QueryStreamWriter.java#L42

Additional context

Example community request for this: https://groups.google.com/g/ksql-users/c/iB6mo1tHgn0/m/dEzpEDHwAAAJ

purplefox commented 3 years ago

A common approach with clients to prevent idle connections being timed out would be to send a "ping" request (which does nothing) every few seconds.

vcrfxia commented 3 years ago

A workaround that can be used in the meantime (courtesy of @AlexCasdu): https://groups.google.com/g/ksql-users/c/iB6mo1tHgn0/m/ARfID-RoAgAJ

agavra commented 3 years ago

Moving to 0.19, we'll need to tackle this as part of #7203

vvcephei commented 3 years ago

As far as I can tell from the code, the WRITE_TIMEOUT_MS is probably not involved here. That property looks like it's only used to put an upper bound on how long to wait for the client to read our buffered responses. As long as the reader is consuming responses, we shouldn't hit this timeout.

Based on the reports and the proposed workaround, the problem just seems to be the idle timeout.

We can certainly add a heartbeat mechanism to the Java client, but that doesn't solve the general problem for users of the server APIs. Each API caller will have to set up their own heartbeat mechanism.

Alternatively, we could just increase the server's idle connection timeout, possibly to a large number. However, this defeats the server's ability to close connections that aren't being used. Maybe this is ok; if clients actually hang up, the socket itself will close, closing the connection, so this is only for clients that are still holding an open socket but not using it.

If we think that the number of client connections to the ksqlDB are limited, then perhaps closing idle connections actually doesn't provide much value. From my armchair, it seems like the number of ksqlDB client connections should be limited. This isn't like a public web server; only internal hosts should be connecting to a database.

Another approach is to let the server send "pings" back over the response channel when it's working on a long-running response. For example, when handling a request, the server could set up a timer to send a null byte back over the response channel once a minute or so. That way, truly idle connections (no requests) will still get closed, while active connections that have slow responses or low data volume will stay open. The downside of this approach is that we have to pick a "ping" datum that doesn't break the response itself. E.g., is it actually ok to insert random null bytes into a JSON response stream? If not, then what data, if any, is ok to insert?

Due to the relative ROI of the approach and the known-ness of the downsides, I'm leaning toward a larger idle timeout. For example, 24 hours, and possibly adding a server config property so people can set it to anything they like. Any objections?

agavra commented 3 years ago

If we think that the number of client connections to the ksqlDB are limited, then perhaps closing idle connections actually doesn't provide much value.

This assumption is probably valid today. What would a number of clients that is large enough to actually cause an issue here? I'm wondering if we start supporting IOT style use-cases with push queries if 1000s of clients will be an issue (cc @AlanConfluent).

Otherwise, I'm fine with the suggestion - increasing idle timeout seems like a high ROI suggestion to me.

vvcephei commented 3 years ago

Thanks, @agavra ,

I should clarify my statement: I don't think it's a good idea to open up any database for connections from the internet, regardless of the use case. For example, an IoT setup that distributes database credentials to all the devices and then lets them directly connect to the database seems dubiously wise to me.

I'd suggest for such a system to set up a web server with the appropriate technology to accept connections from the IoT fleet, and then itself form a connection to the database for persistence/queries. That's also not a trivial job, but there are managed offering specifically for that use case (such as AWS API Gateway).

I think it could be a cool offering if we wanted to take KSQL in a Firebase-like direction, but there's a huge pile of work we'd have to tackle to get there, so I think that for now we can just lean on the assumption that ksqlDB sits in an application architecture in a similar place as a traditional database.

agavra commented 3 years ago

yeah that makes sense, don't know what I was thinking. +1 to the suggestion

vvcephei commented 3 years ago

Follow-up: while testing my proposed fix, I noticed that we actually do write a newline character about once a second when you're using the http1.1 endpoint. The purpose is actually to verify that the connection is still open (see io.confluent.ksql.rest.server.resources.streaming.QueryStreamWriter#write(java.io.OutputStream), but a side effect is that we will also keep the connection from becoming idle.

So, it seems like this issue only affects the HTTP2 API. I spent some time digging around in the code, but the async code in there makes it hard to find a good place to add a similar "ping" operation. We should also consider that we need not only to keep alive the client/server connection, but also the server/server connections within the cluster. (Although, it's likely that the nodes in the cluster will wind up chatting with each other frequently enough that we won't time out in practice.)

The most general place to insert these pings would be at the actual data source. For transient queries, this is the foreach operation, where we copy each in-flight record into the blocking queue, which ultimately gets flushed into the async pipeline. This is in io.confluent.ksql.query.QueryExecutor#buildTransientQueryQueue. Instead of using forEach, we could use a ValueTransformer with a registered wall-clock punctuation, say, once a minute to send back a "ping".

Here is a poc of this approach: https://github.com/confluentinc/ksql/pull/7555

vvcephei commented 3 years ago

Ok, here's the config PR: https://github.com/confluentinc/ksql/pull/7556

@agavra (and maybe @AlanConfluent ), now that there is a POC of the "ping" approach (#7555), what's your sense of whether we really want the config, or whether we should double down on the ping approach?

vvcephei commented 3 years ago

Fixed via https://github.com/confluentinc/ksql/pull/7556 This should be released in 0.19

We'll look into pursuing #7555 on its own merits later, but just increasing the idle timeout should help for now.

purplefox commented 3 years ago

Maybe this is ok; if clients actually hang up, the socket itself will close, closing the connection, so this is only for clients that are still holding an open socket but not using it.

This is not really true. TCP is a reliable protocol and will not close a connection if the two ends of the connection stop being able to communicate - TCP assumes this might be some transient network issue that it can recover from. If the client machine suddenly disappears (the OS hard crashes or the client machine has power loss) or someone pulls the network cable to the client, the server will have no way of distinguishing that from a transient network failure and the connection will not be closed until it eventually times out. This is why pinging, or a short idle timeout is important to prevent lots of connections hanging around. (Failure to detect dead connections using pinging or long idle timeouts can also be exploited by bad clients to create denial of service attacks). Most http servers for example have very short idle timeouts (seconds or maybe small numbers of minutes). That's not appropriate for long lived connections such as in a push query, so pinging seems the only sensible option imho.

confluentinc / ksql

Allow push queries to be streamed for more than ten minutes at a time #6970