Open big-andy-coates opened 4 years ago
I think terminates are useful in the cases of long-running INSERT INTO
. You don't necessarily want to drop the source, but you want the INSERT INTO
to stop populating into it.
Thinking on this more, I don't think we need to explicitly terminate a query. Our CREATE TABLE AS SELECT style statements are akin to materialized views in the rdbs world. In the rdbs there is no concept of a persistent query exposed to the user. Instead, if you create a MV, when you drop the MV any 'process' in the background that is updating the MV is automatically stopped.
The only thorn in our side is the INSERT INTO
query which, as @almog points out, may still benefit from TERMINATE
. However, there is an alternative... we remove INSERT INTO
!
INSERT INTO
is the black sheep of the family. It outputs a persistent query to an existing sink that's been created some other way. It was added to allow multiple queries to be started that all write to the same sink topic. However, I think this would better be represented using a SQL UNION
, or more correctly a UNION ALL
, e.g.
CREATE STREAM OUTPUT (...) WITH (...);
INSERT INTO OUTPUT SELECT * FROM SOURCE1 ...;
INSERT INTO OUTPUT SELECT * FROM SOURCE2 ...;
Becomes:
CREATE STREAM OUTPUT AS
SELECT * FROM SOURCE1 ...
UNION ALL
SELECT * FROM SOURCE2 ..
;
Which once again brings us to a 1-2-1 relationship between persistent query and MV. So now when the user drops OUTPUT we can stop the persistent query, and we no longer need TERMINATE.
An added benefit of UNION ALL
would be deterministic output ordering with flow control.
In contrast the record ordering with INSERT INTO
depends on the starting point of the queries and the speed of the consumers
+1 from me on removing TERMINATE
.
@agavra raises a good point about TERMINATE
being useful for long-running INSERT
statements, although I'm not sure that this is something that belongs in the syntax. This is more of an operational function, which I think are best served by function calls on catalog data. For example, here's how you can kill a query in Postgres:
-- pg_stat_activity tracks queries that are currently running
SELECT pg_terminate_backend(pid) FROM pg_stat_activity;
Another possible use case for terminate would be to leave a table around for pull queries (I know we can't query tables directly yet, but surely we plan to at some point), but terminate the queries that populate it (e.g. it's like a snapshot table).
Do other streaming systems not support some form of INSERT INTO
? I'm surprised since without it the graph of relationships between sources (streams/tables) must always be acyclic and I imagine there are use cases where having some sort of cyclic control flow makes sense. (Will have to think harder to come up with a concrete example, will report back if/when I do.)
Another possible use case for terminate would be to leave a table around for pull queries (I know we can't query tables directly yet, but surely we plan to at some point), but terminate the queries that populate it (e.g. it's like a snapshot table).
I was thinking about this and I think we can handle this with a query upgrade. We simply upgrade the source to have no query associated with it (but keep the DDL).
I imagine there are use cases where having some sort of cyclic control flow makes sense
That's pretty trippy @vcrfxia - let me know if you come up with anything!
With reference to the PR that introduced
TERMINATE ALL
syntax.We might look to remove the concept of terminating queries at all. We offer no way of restarting a terminated query, so why off a way to terminate?
If we choose to introduce a way to restart a query, then terminating actually means something. Restarting would be useful for re-kicking a failed query. However, there are probably better ways of handling failed queries. After all, a traditional db does not expose the state of the processing used to build a materialized view.
Removing
terminate
would address issues such as: