[IDAG] Fix re-ordering around eager assignment in out_of_order_engine

fknorr commented 1 month ago

Fixes a (very rarely triggered) assertion seen in this CI run of the IDAG branch.

The bug condition manifests as follows:

A host-task or device kernel instruction k1 is submitted.
k1 is assigned. lane.last_incomplete_submission is set to k1.
A second instruction k2 is submitted on the same device as k1 and is marked as conditional_eager_assignable because its only predecessor is k1.
A third instruction k3 is submitted, also with the single predecessor k1 but with a higher piority than k2. It is marked as conditional_eager_assignable in the same fashion.
k3 is eagerly assigned. lane.last_incomplete_submission is set to k3.
The executor queries the async_events for k1 and k3 in exactly this order, and both kernels complete in-between those two event queries, so out_of_order_engine is only notified about k3's completion. This out-of-order completion is explicitly permitted by the engine. lane.last_incomplete_submission is set to nullptr.
Eager assignment is attempted on k2, which expects lane.last_incomplete_submission to be k1 (success) or k3 (abort), but finds nullptr. pop_assignable treats this condition as success, but assign_one asserts against it.

Technically, k2 is transitively unconditional assignable because we know via nullptr that there is no ongoing work on our lane. However upgrading this situation to unconditional_assignable_state while still waiting for the call to complete(k1) would open a whole new can of worms for this incredibly niche situation, so we abort conditional assignment in this case.

github-actions[bot] commented 1 month ago

Check-perf-impact results: (7b849de16ff11660b98988ab0b032db7)

:question: No new benchmark data submitted. :question:
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

coveralls commented 1 month ago

Pull Request Test Coverage Report for Build 9992829395

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

1 of 1 (100.0%) changed or added relevant line in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.002%) to 93.075%

Totals
Change from base Build 9945915519:	-0.002%
Covered Lines:	7231
Relevant Lines:	7507

celerity / celerity-runtime