Closed ericstj closed 9 months ago
This is happening on both Windows and Ubuntu ARM. Here's a hit on Windows: https://dev.azure.com/dnceng-public/public/_build/results?buildId=505850&view=ms.vss-test-web.build-test-results-tab&runId=11789302&resultId=100910&paneView=debug
It seems to be happening in many different test cases. Here's another unique failure mode: https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-machinelearning-refs-pull-6917-merge-28e2771c93374e42be/Microsoft.ML.Predictor.Tests/1/console.16ed6aa3.log?helixlogtype=result Same stack but buried in the console spew for the test (so missing from the test results).
My only theory here is that instructions are reordered such that a newer value for _rowCount
is used than the value of _valueBoundaries
. I don't see anything here that protects against multi-threaded access to these fields.
I'm also seeing this failure as far back as we have build history. Here's one from a rolling build on 11/30: https://dev.azure.com/dnceng-public/public/_build/results?buildId=485579&view=ms.vss-test-web.build-test-results-tab&runId=11188822&resultId=101789&paneView=debug
Discussed this with some folks (thanks @lambdageek!) and believe the writes are being reorded as described here: https://kunalspathak.github.io/2020-07-25-ARM64-Memory-Barriers/
You can see that here: https://github.com/dotnet/machinelearning/pull/6922 https://dev.azure.com/dnceng-public/public/_build/results?buildId=506884&view=ms.vss-test-web.build-test-results-tab&runId=11818594&resultId=100711&paneView=attachments
The test fails
(1) Unexpected exception: Assert failed: _rowCount = 32, Utils.Size(_valueBoundaries) = 32
So we have an incremented row count, but an array that hasn't been resized.
I think this is only impacting the assert. The idx
passed in will still be less than the total count, since that's gated by the waiter, which is updated by the fetching thread.
One fix here would be to introduce a write barrier where _rowCount
is updated. That would ensure we'd never be in a state where _rowCount
has a newer value than the arrays it's used to resize.
Alternatively, since _rowCount
is never actually used in this method (except for asserts) I could just change all the asserts to be checking with idx
and idx + 1
. We can make it so _rowCount
is only ever read by the caching thread, and thus we don't need any write barriers - since that will be guaranteed on a single thread.
This failed 3 separate times for me in one PR on Arm. https://github.com/dotnet/machinelearning/blob/5483ba93c591367e9465884ca23feab79f4bf1f0/src/Microsoft.ML.Data/DataView/CacheDataView.cs#L1389
What's odd to me is that the both
_indexBoundaries
and_valueBoundaries
should be consistent as they're resized at the same time, yet we only hit the assert for_valueBoundaries
.Build Information
Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=505863 Build error leg or test failing: Microsoft.ML.RunTests.TestEntryPoints.TestCrossValidationMacro Pull request: https://github.com/dotnet/machinelearning/pull/6917
Error Message
Fill the error message using step by step known issues guidance.
Known issue validation
Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=505863 Error message validated:
Microsoft.ML.Data.CacheDataView.ColumnCache.ImplVec
1.Fetch(Int32 idx, VBuffer1& value)
Result validation: :white_check_mark: Known issue matched with the provided build. Validation performed at: 12/21/2023 4:49:08 PM UTCReport
Summary