Closed gatoWololo closed 2 years ago
That sounds like a great plan! If it is helpful, the timely logging infrastructure produces streams of scheduling events, message communication and receipt events, stuff like that. It can be helpful to determine what is on the critical path. For inspiration, maybe check out
https://github.com/MaterializeInc/materialize/tree/master/src/dataflow/src/logging
The crdt
example is definitely a bit mysterious: if you slowly play out the updates, rather than in one batch, it takes quite a bit longer. My guess is that there is a nice big global aggregation that ends up doing and re-doing a bunch of work.
Thanks for the tips!
I'll look into it.
The reason for this seems to be that the program as written exhibits two iterative scopes that need to perform many thousands of iterations. The ancestor collapsing takes over 7000 iterations, and the "blank star" collapsing takes over 13000 iterations. The control flow aspects of these iterations take some time (tens of microseconds, it seems) and they are inherently sequential rather than parallel.
The problem can be fixed by using a different algorithm for these stages. If you collapse these paths using an iterated contraction algorithm, as in the crdt_improvements
branch, they take tens of iterations and the running times look like (on my desktop):
threads | time |
---|---|
1 | 965.782789ms |
2 | 539.08139ms |
4 | 302.270496ms |
8 | 221.848203ms |
So, certainly some better scaling here, and just generally better performance as well (the overhead of the loops is large, even ignoring the lack of scaling).
I'm going to close this out, as I believe the mystery has been resolved!
Following up on https://github.com/TimelyDataflow/differential-dataflow/issues/273 and giving a more concrete example. CRDT seems to exhibit particularly poor scaling. Adding additional workers results in worse run times:
Looking at some perf flame graphs of one worker versus eight workers: One Worker
Eight Workers
For eight worker, it seems a lot of time is spent is spent on
step_or_park
but not actually stepping. Instrumenting theadvance
function with an atomic counter:I'll try to figure out why
advance
is being called so much with eight workers.