TaskDriver - Fix edge case behaviours

mbaker3 commented 1 year ago

After the optimization pass in #264 a number of edge case issues were discovered and fixed.

Synchronous Writing

Acquiring a synchronous writer would not invalidate the data stream in a way that consumers could detect. Rather than relying on the shared read JobHandle to detect data invalidation we now increment a data version uint that changes on each write. Also, the original IsDataInvalidated check was backward but that's fixed now and I buried that little tidbit mid paragraph to save @jkeon some embarrassment.

Cancel Unwinding

Data streams configured with CancelBehaviour.Unwind would not have their unwinding job triggered. This was because the regular active data stream was being checked for invalidation (and length). Changing the data invalidation check to interrogate the pendingCancelActiveDataStream and track its data version fixed the issue. As a result of the bug caused by confusing naming all instances of pendingCancel... and pendingCancelActive... have been changed to activeCancel... to avoid any confusion with the "pending" term used to represent the thread count wide data structures used by the non blocking writers.

`CancelProgressCheck` Scheduling Without Work

CancelProgressFlowNode was tracking the last processed version of the data for both itself and its parent. Since all children are processed before the parent the parent version would always get invalidated for the child's next frame.

Child invalidates parent and self check cancel progress
Parent invalidates self during check cancel progress --- next frame ---
Child thinks parent's data has changed since it has the data version from before the parent checked it's cancel progress.

To solve, each node now tracks its own last data version. The child will ask the parent if the parent's data is invalidated since it only matters if the data has changed since the last check cancel progress pass.

`CancelProgressCheck` Scheduling Multiple Times For Each `TaskDriverSystem` Type

For the purposes of cancellation, each TaskDriver instance has its corresponding TaskDriverSystem as a child. However there is only one TaskDriverSystem instance per TaskDriver type per World. This means that each TaskDriverSystem ends up having multiple CancelProgressFlowNode instances pointing to it and, as a result, separate data invalidation tracking. This means that the system streams were:

Having their CancelProgressChecks run multiple times per frame (once per TD instance). This was a waste of time.
Incrementing the data version multiple times each frame when more than one TD instance existed. This caused all systems to cancel check every frame even when there were no "real" potential writes from the outside.

To solve, each TaskDriver or TaskDriverSystem instance that a CancelProgressFlowNode represents now stores its last processed data version per TD/TDS instance NOT per CancelProgressFlowNode instance. So multiple CancelProgressFlowNodes that point to the same TD/TDS instance will all share a data version and only the first CancelProgressFlowNode processed will perform the CancelProgressCheck

Repeated, Self, Data Invalidation

There were a number of cases where data invalidation was being detect, current data version stored, and then a write handle acquired when processing the data. The final acquisition would cause the data version to increment guaranteeing that the data would be considered "invalid" on the next frame.

"Last Data Version" is now read AFTER all acquire request are made so that the process doesn't detect invalidation caused by its own processing. This occurred in:

AbstractDataSource
CancelProgressFlowNode
CancelScheduleInfo
DataStreamScheduleInfo
UpdateScheduleInfo

Fix Cancellation Soft Lock

Fixed soft lock if a cancel request encounters a stream configured to unwind that doesn't have any data in it.

A job configured on an unwind stream that has no data will not schedule. Since the job doesn't schedule it doesn't get a write handle to the cancel progress lookup and then doesn't increment the data version.

Since the data version doesn't increment the CancelProgressFlowNode doesn't re-process and the cancel tree is stuck in a perpetual state of waiting for the unwind to complete. Sending subsequent cancel requests will bump the processing and allow the cancel tree to resolve but this will happen for each node in the tree where no data is in the stream!

Prevent Stale Data from persisting in data sources

Fix bug where active data would remain in the stream/lookup when the data version no longer changed. This is an issue for any data source that is read when the data version doesn't change (Ex: CancelRequestDataSource). Now all data sources will get one extra consolidation after the last data version change.

Specifically, this fixed an issue where cancellation would never complete for a given key when a stream configured to unwind was present in the cancellation tree, had no elements, and no writers after cancellation.

What issues does this resolve?

None

What PRs does this depend on?

None

Does this introduce a breaking change?

[ ] Yes
[x] No

mbaker3 commented 1 year ago

@jkeon Responded to comments. Open to discuss on a call if you want.

mbaker3 commented 1 year ago

Ready for re-review

Addressed feedback from the review.

Also found a couple more edge cases and fixed them in the commits following the feedback.

9bdf306 - Fix Cancellation Soft Lock

Fixed soft lock if a cancel request encounters a stream configured to unwind that doesn't have any data in it.

A job configured on an unwind stream that has no data will not schedule. Since the job doesn't schedule it doesn't get a write handle to the cancel progress lookup and then doesn't increment the data version.

Since the data version doesn't increment the CancelProgressFlowNode doesn't re-process and the cancel tree is stuck in a perpetual state of waiting for the unwind to complete. Sending subsequent cancel requests will bump the processing and allow the cancel tree to resolve but this will happen for each node in the tree where no data is in the stream!

adf85c0 - Prevent Stale Data from persisting in data sources

Fix bug where active data would remain in the stream/lookup when the data version no longer changed. This is an issue for any data source that is read when the data version doesn't change (Ex: CancelRequestDataSource). Now all data sources will get one extra consolidation after the last data version change.

Specifically, this fixed an issue where cancellation would never complete for a given key when a stream configured to unwind was present in the cancellation tree, had no elements, and no writers after cancellation.

mbaker3 commented 1 year ago

All good, thanks for helping me figure these out. I have a slightly better understanding of how this all works now!

decline-cookies / anvil-unity-dots