confluentinc / ksql

The database purpose-built for stream processing applications.
https://ksqldb.io
Other
128 stars 1.04k forks source link

Consider removing the concept of terminating a query #3967

Open big-andy-coates opened 4 years ago

big-andy-coates commented 4 years ago

With reference to the PR that introduced TERMINATE ALL syntax.

We might look to remove the concept of terminating queries at all. We offer no way of restarting a terminated query, so why off a way to terminate?

If we choose to introduce a way to restart a query, then terminating actually means something. Restarting would be useful for re-kicking a failed query. However, there are probably better ways of handling failed queries. After all, a traditional db does not expose the state of the processing used to build a materialized view.

Removing terminate would address issues such as:

agavra commented 4 years ago

I think terminates are useful in the cases of long-running INSERT INTO. You don't necessarily want to drop the source, but you want the INSERT INTO to stop populating into it.

big-andy-coates commented 4 years ago

Thinking on this more, I don't think we need to explicitly terminate a query. Our CREATE TABLE AS SELECT style statements are akin to materialized views in the rdbs world. In the rdbs there is no concept of a persistent query exposed to the user. Instead, if you create a MV, when you drop the MV any 'process' in the background that is updating the MV is automatically stopped.

The only thorn in our side is the INSERT INTO query which, as @almog points out, may still benefit from TERMINATE. However, there is an alternative... we remove INSERT INTO!

INSERT INTO is the black sheep of the family. It outputs a persistent query to an existing sink that's been created some other way. It was added to allow multiple queries to be started that all write to the same sink topic. However, I think this would better be represented using a SQL UNION, or more correctly a UNION ALL, e.g.

CREATE STREAM OUTPUT (...) WITH (...);
INSERT INTO OUTPUT SELECT * FROM SOURCE1 ...;
INSERT INTO OUTPUT SELECT * FROM SOURCE2 ...;

Becomes:

CREATE STREAM OUTPUT AS 
   SELECT * FROM SOURCE1 ...
   UNION ALL
   SELECT * FROM SOURCE2 ..
   ;

Which once again brings us to a 1-2-1 relationship between persistent query and MV. So now when the user drops OUTPUT we can stop the persistent query, and we no longer need TERMINATE.

PeterLindner commented 4 years ago

An added benefit of UNION ALL would be deterministic output ordering with flow control.

In contrast the record ordering with INSERT INTO depends on the starting point of the queries and the speed of the consumers

derekjn commented 4 years ago

+1 from me on removing TERMINATE.

@agavra raises a good point about TERMINATE being useful for long-running INSERT statements, although I'm not sure that this is something that belongs in the syntax. This is more of an operational function, which I think are best served by function calls on catalog data. For example, here's how you can kill a query in Postgres:

-- pg_stat_activity tracks queries that are currently running
SELECT pg_terminate_backend(pid) FROM pg_stat_activity;
rodesai commented 4 years ago

Another possible use case for terminate would be to leave a table around for pull queries (I know we can't query tables directly yet, but surely we plan to at some point), but terminate the queries that populate it (e.g. it's like a snapshot table).

vcrfxia commented 4 years ago

Do other streaming systems not support some form of INSERT INTO? I'm surprised since without it the graph of relationships between sources (streams/tables) must always be acyclic and I imagine there are use cases where having some sort of cyclic control flow makes sense. (Will have to think harder to come up with a concrete example, will report back if/when I do.)

agavra commented 4 years ago

Another possible use case for terminate would be to leave a table around for pull queries (I know we can't query tables directly yet, but surely we plan to at some point), but terminate the queries that populate it (e.g. it's like a snapshot table).

I was thinking about this and I think we can handle this with a query upgrade. We simply upgrade the source to have no query associated with it (but keep the DDL).

I imagine there are use cases where having some sort of cyclic control flow makes sense

That's pretty trippy @vcrfxia - let me know if you come up with anything!