Closed frankmcsherry closed 1 year ago
Unfortunately it doesnt fix the operator leak of running TPC-H loadgen + materialized view with query 14. This is the situation after drop materialized view q14;
and waiting 30s:
and this is mz_dataflow_operator_dataflows
:
materialize=> set database to tpch;
SET
materialize=> drop materialized view q14;
DROP MATERIALIZED VIEW
materialize=> select * from mz_internal.mz_dataflow_operator_dataflows;
id | name | worker_id | dataflow_id | dataflow_name
-----+------------------------------------------+-----------+-------------+-------------------
333 | Map | 0 | 188 | Dataflow: 2.6.q14
329 | FlatMap | 0 | 188 | Dataflow: 2.6.q14
331 | Exchange | 0 | 188 | Dataflow: 2.6.q14
326 | InspectBatch | 0 | 188 | Dataflow: 2.6.q14
338 | InspectBatch | 0 | 188 | Dataflow: 2.6.q14
345 | Dataflow: 2.6.q14 | 0 | 188 | Dataflow: 2.6.q14
188 | Dataflow: 2.6.q14 | 0 | 188 | Dataflow: 2.6.q14
328 | persist_sink u11 write_batches | 0 | 188 | Dataflow: 2.6.q14
340 | persist_sink u11 append_batches | 0 | 188 | Dataflow: 2.6.q14
323 | persist_sink u11 mint_batch_descriptions | 0 | 188 | Dataflow: 2.6.q14
(10 rows)
materialize=> show materialized views;
name | cluster
------+---------
(0 rows)
materialize=>
Ah, yes this wasn't meant to fix that for certain. It does fix a leak for e.g. simple.rs
, and I'm happy to crack open the same example in MZ (the one based on generate_series(0, large)
).
Great, that appears to have the desired outcome, as we think that the remaining operators are the ones that haven't shut down, for whatever reason.
Operator shutdown was previously pretty loose, and only in response to operator activation. However, the conditions for shutdown can change without prompting an activation if e.g. a frontier becomes empty or a final capability is dropped. This meant that operators that should be shut down would instead linger until the dataflow itself is shut down.
This PR adds that test as progress information is pushed to operators, in order to better clean up operators mid-dataflow.
NB: Failing to shut down an operator should not have resulted in non-termination, unless operators were relying on dropping their state to signal something of consequence outward. All progress information would still be correct, and all downstream operators would receive correct frontiers.