Detect uninitialized reads and overlapping writes

fknorr commented 10 months ago

During IDAG development, asserting that instructions do not read from uninitialized memory turned out to be an invaluable tool for finding bugs in dependency tracking and data migration. This PR is a back-port of that detection logic, extended to also find overlapping writes between chunks on CDAG generation.

By default, task_manager and distributed_graph_generator throw when detecting an uninitialized read or overlapping writes in order to catch internal errors and report tests with bogus access patterns. In the runtime, results from these checks are reported through warning (uninitialized read) and error (overlapping writes) logs instead - both for backwards compatibillity, and because uninitialized reads are expected when submitting kernels with read_write access in a loop (see also celerity/meta#74).

All tests with incorrect access patterns were fixed, and the reporting feature is tested both with the TDAG / CDAG test context infrastructure as well as the actual runtime.

github-actions[bot] commented 10 months ago

no optimizations

Check-perf-impact results: (1d1f40270fc25a329bb16053c8fa15f2)

:warning: Significant slowdown (>1.25x) in some microbenchmark results: building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / expanding tree topology, generating large command graphs for N nodes - 16 / wave_sim topology, benchmark independent task pattern with N tasks - 100 / task generation
:rocket: Significant speedup (<0.80x) in some microbenchmark results: normalizing a fully mergeable tiling of boxes - 1 / large, embedded in 3d

Relative execution time per category: (mean of relative medians)

command-graph : 1.17x :warning:
graph-nodes : 0.99x
grid : 1.00x
scheduler : 1.12x :warning:
system : 1.10x
task-graph : 1.01x

github-actions[bot] commented 10 months ago

distributed overlapping-write detection

Check-perf-impact results: (7e3c8bfa480dfad2a4f08c4fabe32cc2)

:warning: Significant slowdown (>1.25x) in some microbenchmark results: normalizing randomized box sets - 3d / small - native, building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / expanding tree topology, benchmark independent task pattern with N tasks - 100 / task generation
:rocket: Significant speedup (<0.80x) in some microbenchmark results: benchmark stencil pattern with N time steps - 50 / iterations

Relative execution time per category: (mean of relative medians)

command-graph : 1.09x
graph-nodes : 0.99x
grid : 1.02x
scheduler : 1.09x
system : 1.08x
task-graph : 1.02x

github-actions[bot] commented 10 months ago

distributed overlapping-write detection + bounding-box uninitialized-read detection

Check-perf-impact results: (385bf6fbfe375253106eaa9f9b53a88e)

:warning: Significant slowdown (>1.25x) in some microbenchmark results: benchmark independent task pattern with N tasks - 100 / task generation
:rocket: Significant speedup (<0.80x) in some microbenchmark results: benchmark stencil pattern with N time steps - 1000 / iterations

Relative execution time per category: (mean of relative medians)

command-graph : 1.10x
graph-nodes : 0.98x
grid : 1.00x
scheduler : 1.06x
system : 1.05x
task-graph : 0.98x

fknorr commented 10 months ago

I managed to optimize this a bit further, but CDAG generation still takes a 10% performance hit.

github-actions[bot] commented 9 months ago

disabling diagnostics by default in release builds

Check-perf-impact results: (9c7d95e3b2850796c2d810616c0f1aff)

:warning: Significant slowdown (>1.25x) in some microbenchmark results: building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / expanding tree topology
:rocket: Significant speedup (<0.80x) in some microbenchmark results: normalizing a fully mergeable tiling of boxes - 1 / large, embedded in 3d

Relative execution time per category: (mean of relative medians)

command-graph : 1.02x
graph-nodes : 1.01x
grid : 1.00x
scheduler : 1.02x
system : 1.00x
task-graph : 1.00x

github-actions[bot] commented 9 months ago

do not maintain initialized-region when diagnostics are disabled

Check-perf-impact results: (a1f01e5fe178a68d42a7063603612487)

:warning: Significant slowdown (>1.25x) in some microbenchmark results: building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / expanding tree topology
:rocket: Significant speedup (<0.80x) in some microbenchmark results: building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / chain topology

Relative execution time per category: (mean of relative medians)

command-graph : 1.01x
graph-nodes : 0.99x
grid : 1.01x
scheduler : 1.02x
system : 0.99x
task-graph : 1.00x

fknorr commented 9 months ago

I have turned the task-manager / graph-generator policy settings into constructor arguments; that way I can skip updating the initialized-region of a buffer when diagnostics are disabled (previously this region had to be maintained in anticipation of a change in polices, which of course never happened). I've also added a bit more documentation.

github-actions[bot] commented 9 months ago

Check-perf-impact results: (3b34e58e3c100f4c3541a1ed59580f72)

:question: No new benchmark data submitted. :question:
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

github-actions[bot] commented 9 months ago

Check-perf-impact results: (54ff34045c40bf12cbc2ca95f475792b)

:warning: Significant slowdown (>1.25x) in some microbenchmark results: building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / expanding tree topology
:rocket: Significant speedup (<0.80x) in some microbenchmark results: building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / chain topology, building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / contracting tree topology

Relative execution time per category: (mean of relative medians)

command-graph : 1.00x
graph-nodes : 0.98x
grid : 1.01x
scheduler : 1.01x
system : 0.98x
task-graph : 1.01x

fknorr commented 9 months ago

I have used this opportunity to refactor how task debug names are generated from class KernelName types and printed in debug outputs. Unnamed kernels now do not show "unnamed_kernel" as their name anymore and the task type is visualized as well.

There is now a new step_builder::name() function to conveniently set task debug names from a distributed_graph_generator_test_context.

celerity / celerity-runtime