dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.02k stars 1.88k forks source link

Failed assert in ImplVec<T>.Fetch on Arm #6921

Closed ericstj closed 9 months ago

ericstj commented 9 months ago

This failed 3 separate times for me in one PR on Arm. https://github.com/dotnet/machinelearning/blob/5483ba93c591367e9465884ca23feab79f4bf1f0/src/Microsoft.ML.Data/DataView/CacheDataView.cs#L1389

What's odd to me is that the both _indexBoundaries and _valueBoundaries should be consistent as they're resized at the same time, yet we only hit the assert for _valueBoundaries.

Build Information

Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=505863 Build error leg or test failing: Microsoft.ML.RunTests.TestEntryPoints.TestCrossValidationMacro Pull request: https://github.com/dotnet/machinelearning/pull/6917

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "Microsoft.ML.Data.CacheDataView.ColumnCache.ImplVec`1.Fetch(Int32 idx, VBuffer`1& value)",
  "ErrorPattern": "",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Known issue validation

Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=505863 Error message validated: Microsoft.ML.Data.CacheDataView.ColumnCache.ImplVec1.Fetch(Int32 idx, VBuffer1& value) Result validation: :white_check_mark: Known issue matched with the provided build. Validation performed at: 12/21/2023 4:49:08 PM UTC

Report

Build Definition Test Pull Request
505850 dotnet/machinelearning Microsoft.ML.Core.Tests.WorkItemExecution dotnet/machinelearning#6703
506779 dotnet/machinelearning Microsoft.ML.Tests.WorkItemExecution dotnet/machinelearning#6794
505863 dotnet/machinelearning Microsoft.ML.Predictor.Tests.WorkItemExecution dotnet/machinelearning#6917

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 3 3
ericstj commented 9 months ago

This is happening on both Windows and Ubuntu ARM. Here's a hit on Windows: https://dev.azure.com/dnceng-public/public/_build/results?buildId=505850&view=ms.vss-test-web.build-test-results-tab&runId=11789302&resultId=100910&paneView=debug

It seems to be happening in many different test cases. Here's another unique failure mode: https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-machinelearning-refs-pull-6917-merge-28e2771c93374e42be/Microsoft.ML.Predictor.Tests/1/console.16ed6aa3.log?helixlogtype=result Same stack but buried in the console spew for the test (so missing from the test results).

My only theory here is that instructions are reordered such that a newer value for _rowCount is used than the value of _valueBoundaries. I don't see anything here that protects against multi-threaded access to these fields.

ericstj commented 9 months ago

I'm also seeing this failure as far back as we have build history. Here's one from a rolling build on 11/30: https://dev.azure.com/dnceng-public/public/_build/results?buildId=485579&view=ms.vss-test-web.build-test-results-tab&runId=11188822&resultId=101789&paneView=debug

ericstj commented 9 months ago

Discussed this with some folks (thanks @lambdageek!) and believe the writes are being reorded as described here: https://kunalspathak.github.io/2020-07-25-ARM64-Memory-Barriers/

You can see that here: https://github.com/dotnet/machinelearning/pull/6922 https://dev.azure.com/dnceng-public/public/_build/results?buildId=506884&view=ms.vss-test-web.build-test-results-tab&runId=11818594&resultId=100711&paneView=attachments

The test fails

(1) Unexpected exception: Assert failed: _rowCount = 32, Utils.Size(_valueBoundaries) = 32

So we have an incremented row count, but an array that hasn't been resized.

I think this is only impacting the assert. The idx passed in will still be less than the total count, since that's gated by the waiter, which is updated by the fetching thread.

One fix here would be to introduce a write barrier where _rowCount is updated. That would ensure we'd never be in a state where _rowCount has a newer value than the arrays it's used to resize.

Alternatively, since _rowCount is never actually used in this method (except for asserts) I could just change all the asserts to be checking with idx and idx + 1. We can make it so _rowCount is only ever read by the caching thread, and thus we don't need any write barriers - since that will be guaranteed on a single thread.