TimelyDataflow / timely-dataflow

A modular implementation of timely dataflow in Rust
MIT License
3.25k stars 273 forks source link

Log frontiers in progress tracking #539

Open teskje opened 9 months ago

teskje commented 9 months ago

This PR adds logging of frontiers to the reachability tracker. This is in support of hydration progress tracking in Materialize, which would like to know the per-operator frontiers to know which ones have finished processing their snapshot inputs.

Look at the commit messages for more details. The actual logging change is in the second commit, commit 1 and 2 are supporting changes that try to limit the performance impact of the additional logging.

The performance impact of this change should be negligible (or even negative!) when reachability logging is disabled and small when reachability logging is enabled. I'm including microbenchmark results in a comment below.

The reason that commit 3 makes the code more performant is, I think, that is makes it possible to allocate the vec that's passed to log_frontier_updates with the correct size upfront, rather than having to geometrically grow two vecs of unknown final size. We can probably get a similar effect without changing how TrackerEvent looks, by counting the number of target and source changes while pushing to pushed_changes. I like the TrackerEvent change because it removes some duplication, but it's not a required one.

teskje commented 9 months ago

Benchmark Results

To test the performance impact of this change, I ran event_driven.rs with 1000 dataflows of a depth of 1000 each for 100k rounds, and measured the time to completion.

The test was performed in two configurations:

Below are the results of 10 runs each, for both of the configurations against the master branch and each of the three commits in this PR:

Screenshot 2023-11-29 at 11 47 56

Screenshot 2023-11-29 at 11 48 25

From the results we can see that in this benchmark the additional logging has no negative effect on runtime if logging is disabled, and has a small negative effect when logging is enabled. The latter is expected.

The former is maybe expected, though there is some weirdness here in that the results show that on master disabling logging actually slows Timely down! I'm not sure why this is. This result occurs consistently when the benchmark is repeated at different times, so it doesn't seem to be caused by noise in the system. Maybe some edge case related to CPU caching?

teskje commented 9 months ago

I wired operator frontier logging up in Mz to verify that this is useful for tracking hydration progress, and it seems to be!

With a simple cross join dataflow:

CREATE TABLE t (a int);
INSERT INTO t SELECT generate_series(1, 10000);

CREATE VIEW v AS SELECT t1.a + t2.a FROM t t1, t t2 WHERE t1.a < 1000000 AND t2.a > 0;
CREATE DEFAULT INDEX ON v;

We get this frontier output:

materialize=> SELECT id, min(time) as time
FROM mz_internal.mz_dataflow_operator_frontiers
JOIN mz_internal.mz_dataflow_addresses USING (address)
JOIN mz_internal.mz_dataflow_operators USING (id)
WHERE
  time IS NOT NULL AND
  list_length(address) = 3
GROUP BY id
ORDER BY time DESC, id ASC;
  id  |     time
------+---------------
 8587 | 1701260559660
 8592 | 1701260558660
 8594 | 1701260558660
 -- ...
 8636 | 1701260558660
 8658 | 1701260383161
 8659 | 1701260383161
 8661 | 1701260383161
 -- ...
 8674 | 1701260383161
(31 rows)

... which implicates operator 8658 as the one that's stuck. And sure enough, looking at the dataflow graph, this operator is the cross join.