celerity / celerity-runtime

High-level C++ for Accelerator Clusters
https://celerity.github.io
MIT License
139 stars 18 forks source link

Add celerity blockchain for task divergence checking #217

Open GagaLP opened 11 months ago

GagaLP commented 11 months ago

This pull request adds a divergence checking mechanism for tasks.

It does so by periodically gathering hashes of all tasks from task_recording and comparing them. When a divergence is detected an error containing the diverged tasks and their full task record is printed like:

[2023-10-02 17:31:07.784] [error] Divergence detected in task graph at index 1:

0x471b0f1db5e4b8e6 on nodes 1 
0xe9fbff654e3748e1 on nodes 0 

[2023-10-02 17:31:07.784] [error] Task record for hash 0x471b0f1db5e4b8e6:

id: 1, debug_name: task_b_4, type: device-compute, cgid: 0
geometry:
         dimensions: 2, global_size: [1,1,1], global_offset: [0,0,0], granularity: [1,1,1]
accesses: 
         bid: 0, buffer_name: , mode: R, req: {[64,0,0] - [128,1,1]}
dependencies: 
         node: 0, kind: true-dep, origin: last-epoch

Additionally it also includes a rudimentary deadlock detection for nodes which are stuck by printing a warning after a given amount of time (eg 10 seconds):

[warning] After 10 seconds of waiting nodes 1, did not move to the next task. The runtime might be stuck.

All of this is automatically turned on by running the program with task recording enabled.

github-actions[bot] commented 11 months ago

Check-perf-impact results: (5a19ced85f862a00d0114dd241122462)

:question: No new benchmark data submitted. :question:
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

github-actions[bot] commented 9 months ago

Check-perf-impact results: (3b34e58e3c100f4c3541a1ed59580f72)

:question: No new benchmark data submitted. :question:
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

github-actions[bot] commented 9 months ago

Check-perf-impact results: (4c65f1399a47e0eb1340f63004745b17)

:question: No new benchmark data submitted. :question:
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

psalz commented 8 months ago

Okay so as discussed offline, we won't include this in 0.5.0 as it needs another revision. The main points: