TimelyDataflow / timely-dataflow

A modular implementation of timely dataflow in Rust
MIT License
3.25k stars 272 forks source link

Activate operators that may want to shut down #488

Closed frankmcsherry closed 1 year ago

frankmcsherry commented 1 year ago

Operator shutdown was previously pretty loose, and only in response to operator activation. However, the conditions for shutdown can change without prompting an activation if e.g. a frontier becomes empty or a final capability is dropped. This meant that operators that should be shut down would instead linger until the dataflow itself is shut down.

This PR adds that test as progress information is pushed to operators, in order to better clean up operators mid-dataflow.

NB: Failing to shut down an operator should not have resulted in non-termination, unless operators were relying on dropping their state to signal something of consequence outward. All progress information would still be correct, and all downstream operators would receive correct frontiers.

lluki commented 1 year ago

Unfortunately it doesnt fix the operator leak of running TPC-H loadgen + materialized view with query 14. This is the situation after drop materialized view q14; and waiting 30s:

pr488

and this is mz_dataflow_operator_dataflows:

materialize=> set database to tpch;
SET
materialize=> drop materialized view q14;
DROP MATERIALIZED VIEW
materialize=> select * from mz_internal.mz_dataflow_operator_dataflows;
 id  |                   name                   | worker_id | dataflow_id |   dataflow_name   
-----+------------------------------------------+-----------+-------------+-------------------
 333 | Map                                      | 0         | 188         | Dataflow: 2.6.q14
 329 | FlatMap                                  | 0         | 188         | Dataflow: 2.6.q14
 331 | Exchange                                 | 0         | 188         | Dataflow: 2.6.q14
 326 | InspectBatch                             | 0         | 188         | Dataflow: 2.6.q14
 338 | InspectBatch                             | 0         | 188         | Dataflow: 2.6.q14
 345 | Dataflow: 2.6.q14                        | 0         | 188         | Dataflow: 2.6.q14
 188 | Dataflow: 2.6.q14                        | 0         | 188         | Dataflow: 2.6.q14
 328 | persist_sink u11 write_batches           | 0         | 188         | Dataflow: 2.6.q14
 340 | persist_sink u11 append_batches          | 0         | 188         | Dataflow: 2.6.q14
 323 | persist_sink u11 mint_batch_descriptions | 0         | 188         | Dataflow: 2.6.q14
(10 rows)

materialize=> show materialized views;
 name | cluster 
------+---------
(0 rows)

materialize=> 
frankmcsherry commented 1 year ago

Ah, yes this wasn't meant to fix that for certain. It does fix a leak for e.g. simple.rs, and I'm happy to crack open the same example in MZ (the one based on generate_series(0, large)).

lluki commented 1 year ago

List of operators during various stages of the TPC-H run: ops.txt

frankmcsherry commented 1 year ago

Great, that appears to have the desired outcome, as we think that the remaining operators are the ones that haven't shut down, for whatever reason.