Fix command graph generation bugs around reductions

celerity / celerity-runtime

High-level C++ for Accelerator Clusters

https://celerity.github.io

MIT License

139 stars 18 forks source link

Fix command graph generation bugs around reductions #223

Closed fknorr closed 10 months ago

fknorr commented 10 months ago

Implementing IDAG reductions uncovered two bugs around reductions in distributed command graph generation:

For an all-to-all reduction, we emit push commands for the partial results from our node followed by a reduction command. The reduction command logically overwrites the buffer contents, so it must anti-depend on these pushes. This bug does not appear to break reductions in the current runtime, most likely because the final reduction result is only committed to device memory once it's being read in the next consumer task.
We elide reduction commands if there only is a single producer chunk. If the result is subsequently read by multiple nodes, we generate push commands on the producer, but failed to generate the corresponding await-pushes on the consumer node.

I've added unit tests for both cases.

github-actions[bot] commented 10 months ago

Check-perf-impact results: (b003273516680ef3e6ca0110b3678f5e)

:question: No new benchmark data submitted. :question:
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

github-actions[bot] commented 10 months ago

Check-perf-impact results: (dee217934841bf19e612d83adf4e7dfb)

:warning: Significant slowdown (>1.25x) in some microbenchmark results: 4 individual benchmarks affected
:rocket: Significant speedup (<0.80x) in some microbenchmark results: building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / jacobi topology

Relative execution time per category: (mean of relative medians)

command-graph : 1.00x
graph-nodes : 0.99x
grid : 1.02x
scheduler : 1.02x
system : 1.15x :warning:
task-graph : 1.01x

github-actions[bot] commented 10 months ago

Check-perf-impact results: (d21ecac39af892ab1c227e6d0ae10ebf)

:warning: Significant slowdown (>1.25x) in some microbenchmark results: building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / expanding tree topology, benchmark independent task pattern with N tasks - 100 / task generation
:rocket: Significant speedup (<0.80x) in some microbenchmark results: benchmark stencil pattern with N time steps - 50 / iterations

Relative execution time per category: (mean of relative medians)

command-graph : 1.00x
graph-nodes : 0.96x
grid : 1.01x
scheduler : 1.03x
system : 1.06x
task-graph : 0.99x

fknorr commented 10 months ago

I re-ran the benchmarks because there seemed to be significant jitter in the system benchmarks, but it appears that "benchmark independent task pattern with 100 tasks" is indeed slowing down, even though the change should not affect code without reductions.

fknorr commented 10 months ago

@PeterTh discovered that results of our multi-threaded benchmarks, especially system benchmarks, are not as stable and reliable as we thought, and our benchmarking setup needs some work.

Aside from extremely obscure reason in instruction cache, OS scheduling or similar, I'm going to trust the command-graph benchmarks which measure this change in isolation and do not show a change in performance.