frankmcsherry / dynamic-datalog

Engines, queries, and data for dynamic Datalog computation
121 stars 8 forks source link

CRDT Slower as workers are added. #5

Closed gatoWololo closed 2 years ago

gatoWololo commented 4 years ago

Following up on https://github.com/TimelyDataflow/differential-dataflow/issues/273 and giving a more concrete example. CRDT seems to exhibit particularly poor scaling. Adding additional workers results in worse run times:

fish> for i in (seq 8)
           echo time ./target/release/crdt ../problems/crdt/input/ -w$i
           time ./target/release/crdt ../problems/crdt/input/ -w$i
      end
time ./target/release/crdt ../problems/crdt/input/ -w1
result: ((), 0, 104851)
2.655789383s    complete

________________________________________________________
Executed in    2.66 secs   fish           external 
   usr time    2.60 secs  246.00 micros    2.60 secs 
   sys time    0.06 secs  100.00 micros    0.06 secs 

time ./target/release/crdt ../problems/crdt/input/ -w2
result: ((), 0, 104851)
2.949718895s    complete
2.949774511s    complete

________________________________________________________
Executed in    2.96 secs   fish           external 
   usr time    5.67 secs  237.00 micros    5.67 secs 
   sys time    0.16 secs   96.00 micros    0.16 secs 

time ./target/release/crdt ../problems/crdt/input/ -w3
result: ((), 0, 104851)
3.068952142s    complete
3.069093637s    complete
3.069253361s    complete

________________________________________________________
Executed in    3.08 secs   fish           external 
   usr time    8.74 secs  223.00 micros    8.73 secs 
   sys time    0.28 secs   91.00 micros    0.28 secs 

time ./target/release/crdt ../problems/crdt/input/ -w4
result: ((), 0, 104851)
3.104697076s    complete
3.104720353s    complete
3.104779249s    complete
3.104817709s    complete

________________________________________________________
Executed in    3.11 secs   fish           external 
   usr time   11.73 secs  217.00 micros   11.73 secs 
   sys time    0.45 secs   87.00 micros    0.45 secs 

time ./target/release/crdt ../problems/crdt/input/ -w5
result: ((), 0, 104851)
3.337570417s    complete
3.337584961s    complete
3.337652864s    complete
3.337643363s    complete
3.337674563s    complete

________________________________________________________
Executed in    3.35 secs   fish           external 
   usr time   15.77 secs  214.00 micros   15.77 secs 
   sys time    0.52 secs   87.00 micros    0.52 secs 

time ./target/release/crdt ../problems/crdt/input/ -w6
result: ((), 0, 104851)
3.468785548s    complete
3.468819104s    complete
3.468858908s    complete
3.468944907s    complete
3.469012034s    complete
3.469172189s    complete

________________________________________________________
Executed in    3.48 secs   fish           external 
   usr time   19.68 secs  217.00 micros   19.68 secs 
   sys time    0.62 secs   88.00 micros    0.62 secs 

time ./target/release/crdt ../problems/crdt/input/ -w7
result: ((), 0, 104851)
3.685714758s    complete
3.685744579s    complete
3.685856582s    complete
3.685880433s    complete
3.685882302s    complete
3.685883169s    complete
3.686006777s    complete

________________________________________________________
Executed in    3.70 secs   fish           external 
   usr time   24.45 secs  222.00 micros   24.45 secs 
   sys time    0.70 secs   89.00 micros    0.70 secs 

time ./target/release/crdt ../problems/crdt/input/ -w8
result: ((), 0, 104851)
3.846975408s    complete
3.846992356s    complete
3.847061956s    complete
3.847267346s    complete
3.847285809s    complete
3.847296428s    complete
3.847290407s    complete
3.847402402s    complete

________________________________________________________
Executed in    3.86 secs   fish           external 
   usr time   28.90 secs  216.00 micros   28.90 secs 
   sys time    1.04 secs   88.00 micros    1.04 secs 

Looking at some perf flame graphs of one worker versus eight workers: One Worker

Eight Workers

For eight worker, it seems a lot of time is spent is spent on step_or_park but not actually stepping. Instrumenting the advance function with an atomic counter:

  > time cargo run --release --bin crdt -- ../problems/crdt/input/ -w1
  3.023245458s    complete
  Calls to advance: 65270
  > time cargo run --release --bin crdt -- ../problems/crdt/input/ -w8
  225.56567529s   complete
  Calls to advance: 4043775664

I'll try to figure out why advance is being called so much with eight workers.

frankmcsherry commented 4 years ago

That sounds like a great plan! If it is helpful, the timely logging infrastructure produces streams of scheduling events, message communication and receipt events, stuff like that. It can be helpful to determine what is on the critical path. For inspiration, maybe check out

https://github.com/MaterializeInc/materialize/tree/master/src/dataflow/src/logging

The crdt example is definitely a bit mysterious: if you slowly play out the updates, rather than in one batch, it takes quite a bit longer. My guess is that there is a nice big global aggregation that ends up doing and re-doing a bunch of work.

gatoWololo commented 4 years ago

Thanks for the tips!

I'll look into it.

frankmcsherry commented 3 years ago

The reason for this seems to be that the program as written exhibits two iterative scopes that need to perform many thousands of iterations. The ancestor collapsing takes over 7000 iterations, and the "blank star" collapsing takes over 13000 iterations. The control flow aspects of these iterations take some time (tens of microseconds, it seems) and they are inherently sequential rather than parallel.

The problem can be fixed by using a different algorithm for these stages. If you collapse these paths using an iterated contraction algorithm, as in the crdt_improvements branch, they take tens of iterations and the running times look like (on my desktop):

threads time
1 965.782789ms
2 539.08139ms
4 302.270496ms
8 221.848203ms

So, certainly some better scaling here, and just generally better performance as well (the overhead of the loops is large, even ignoring the lack of scaling).

frankmcsherry commented 2 years ago

I'm going to close this out, as I believe the mystery has been resolved!