celerity / celerity-runtime

High-level C++ for Accelerator Clusters
https://celerity.github.io
MIT License
139 stars 18 forks source link

Fix command graph generation bugs around reductions #223

Closed fknorr closed 10 months ago

fknorr commented 10 months ago

Implementing IDAG reductions uncovered two bugs around reductions in distributed command graph generation:

I've added unit tests for both cases.

github-actions[bot] commented 10 months ago

Check-perf-impact results: (b003273516680ef3e6ca0110b3678f5e)

:question: No new benchmark data submitted. :question:
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

github-actions[bot] commented 10 months ago

Check-perf-impact results: (dee217934841bf19e612d83adf4e7dfb)

:warning: Significant slowdown (>1.25x) in some microbenchmark results: 4 individual benchmarks affected
:rocket: Significant speedup (<0.80x) in some microbenchmark results: building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / jacobi topology

Relative execution time per category: (mean of relative medians)

github-actions[bot] commented 10 months ago

Check-perf-impact results: (d21ecac39af892ab1c227e6d0ae10ebf)

:warning: Significant slowdown (>1.25x) in some microbenchmark results: building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / expanding tree topology, benchmark independent task pattern with N tasks - 100 / task generation
:rocket: Significant speedup (<0.80x) in some microbenchmark results: benchmark stencil pattern with N time steps - 50 / iterations

Relative execution time per category: (mean of relative medians)

fknorr commented 10 months ago

I re-ran the benchmarks because there seemed to be significant jitter in the system benchmarks, but it appears that "benchmark independent task pattern with 100 tasks" is indeed slowing down, even though the change should not affect code without reductions.

fknorr commented 10 months ago

@PeterTh discovered that results of our multi-threaded benchmarks, especially system benchmarks, are not as stable and reliable as we thought, and our benchmarking setup needs some work.

Aside from extremely obscure reason in instruction cache, OS scheduling or similar, I'm going to trust the command-graph benchmarks which measure this change in isolation and do not show a change in performance.