Closed mbaker3 closed 1 year ago
@jkeon Responded to comments. Open to discuss on a call if you want.
Ready for re-review
Addressed feedback from the review.
Also found a couple more edge cases and fixed them in the commits following the feedback.
Fixed soft lock if a cancel request encounters a stream configured to unwind that doesn't have any data in it.
A job configured on an unwind stream that has no data will not schedule. Since the job doesn't schedule it doesn't get a write handle to the cancel progress lookup and then doesn't increment the data version.
Since the data version doesn't increment the CancelProgressFlowNode doesn't re-process and the cancel tree is stuck in a perpetual state of waiting for the unwind to complete. Sending subsequent cancel requests will bump the processing and allow the cancel tree to resolve but this will happen for each node in the tree where no data is in the stream!
Fix bug where active data would remain in the stream/lookup when the data version no longer changed. This is an issue for any data source that is read when the data version doesn't change (Ex: CancelRequestDataSource). Now all data sources will get one extra consolidation after the last data version change.
Specifically, this fixed an issue where cancellation would never complete for a given key when a stream configured to unwind was present in the cancellation tree, had no elements, and no writers after cancellation.
All good, thanks for helping me figure these out. I have a slightly better understanding of how this all works now!
After the optimization pass in #264 a number of edge case issues were discovered and fixed.
Synchronous Writing
Acquiring a synchronous writer would not invalidate the data stream in a way that consumers could detect. Rather than relying on the shared read
JobHandle
to detect data invalidation we now increment a data versionuint
that changes on each write. Also, the original IsDataInvalidated check was backward but that's fixed now and I buried that little tidbit mid paragraph to save @jkeon some embarrassment.Cancel Unwinding
Data streams configured with
CancelBehaviour.Unwind
would not have their unwinding job triggered. This was because the regular active data stream was being checked for invalidation (and length). Changing the data invalidation check to interrogate thependingCancelActiveDataStream
and track its data version fixed the issue. As a result of the bug caused by confusing naming all instances ofpendingCancel...
andpendingCancelActive...
have been changed toactiveCancel...
to avoid any confusion with the "pending" term used to represent the thread count wide data structures used by the non blocking writers.CancelProgressCheck
Scheduling Without WorkCancelProgressFlowNode
was tracking the last processed version of the data for both itself and its parent. Since all children are processed before the parent the parent version would always get invalidated for the child's next frame.To solve, each node now tracks its own last data version. The child will ask the parent if the parent's data is invalidated since it only matters if the data has changed since the last check cancel progress pass.
CancelProgressCheck
Scheduling Multiple Times For EachTaskDriverSystem
TypeFor the purposes of cancellation, each
TaskDriver
instance has its correspondingTaskDriverSystem
as a child. However there is only oneTaskDriverSystem
instance perTaskDriver
type per World. This means that eachTaskDriverSystem
ends up having multipleCancelProgressFlowNode
instances pointing to it and, as a result, separate data invalidation tracking. This means that the system streams were:To solve, each
TaskDriver
orTaskDriverSystem
instance that aCancelProgressFlowNode
represents now stores its last processed data version per TD/TDS instance NOT perCancelProgressFlowNode
instance. So multipleCancelProgressFlowNode
s that point to the same TD/TDS instance will all share a data version and only the firstCancelProgressFlowNode
processed will perform the CancelProgressCheckRepeated, Self, Data Invalidation
There were a number of cases where data invalidation was being detect, current data version stored, and then a write handle acquired when processing the data. The final acquisition would cause the data version to increment guaranteeing that the data would be considered "invalid" on the next frame.
"Last Data Version" is now read AFTER all acquire request are made so that the process doesn't detect invalidation caused by its own processing. This occurred in:
AbstractDataSource
CancelProgressFlowNode
CancelScheduleInfo
DataStreamScheduleInfo
UpdateScheduleInfo
Fix Cancellation Soft Lock
Fixed soft lock if a cancel request encounters a stream configured to unwind that doesn't have any data in it.
A job configured on an unwind stream that has no data will not schedule. Since the job doesn't schedule it doesn't get a write handle to the cancel progress lookup and then doesn't increment the data version.
Since the data version doesn't increment the CancelProgressFlowNode doesn't re-process and the cancel tree is stuck in a perpetual state of waiting for the unwind to complete. Sending subsequent cancel requests will bump the processing and allow the cancel tree to resolve but this will happen for each node in the tree where no data is in the stream!
Prevent Stale Data from persisting in data sources
Fix bug where active data would remain in the stream/lookup when the data version no longer changed. This is an issue for any data source that is read when the data version doesn't change (Ex: CancelRequestDataSource). Now all data sources will get one extra consolidation after the last data version change.
Specifically, this fixed an issue where cancellation would never complete for a given key when a stream configured to unwind was present in the cancellation tree, had no elements, and no writers after cancellation.
What issues does this resolve?
What PRs does this depend on?
Does this introduce a breaking change?