dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.25k stars 4.73k forks source link

[Arm64/Ubuntu] Arm64 Bootstrap CLI is unstable #9776

Closed sdmaclea closed 4 years ago

sdmaclea commented 6 years ago

I build a Arm64 Bootstrap CLI dotnet-sdk-2.1.300-preview2-008171-linux-arm64 using https://github.com/dotnet/source-build/pull/332

The dotnet command sometimes fails,

For instance:

dotnet new console
dotnet restore
dotnet restore
dotnet restore
dotnet restore
...
...
...
dotnet restore
dotnet restore
dotnet restore
dotnet restore

If I run dotnet restore multiple times it occasionally fails. Out of about 30 runs, I saw:

Similar issues exist with dotnet build

I am hoping the hangs are related to the tailcall hijacking issue @janvorli fixed recently.

The GC holes look similar to some of the spurious gcStress failures. I'll provide more details when I get back to the office.

@dotnet/arm64-contrib @dotnet/jit-contrib @Maoni0 @swgillespie

sdmaclea commented 6 years ago

This is the most common spurious exception

Unhandled Exception: System.InvalidCastException: Unable to cast object of type 'System.Object' to type 'System.Collections.Generic.List`1[System.Object]'.
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Threading.Tasks.Task.FinishSlow(Boolean userDelegateExecute)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot)
   at System.Threading.ThreadPoolWorkQueue.Dispatch()

I have also seen this one

FailFast: Null action in InnerInvoke()

   at System.Diagnostics.Debug.Assert(Boolean condition, String message, String detailMessage)
   at System.Threading.Tasks.Task.InnerInvoke()
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot)
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
sdmaclea commented 6 years ago

Despite being unstable, I have been able to build arm64 natively several times. I do see the most common error above frequently.

sdmaclea commented 6 years ago

As I believe this issue is likely fixed in CoreCLR tip, I have tried again with dotnet-sdk-2.1.300-preview2-008281-linux-arm64, but even the baseline dotnet-sdk-2.1.300-preview2-008281-linux-x64 is failing to work correctly.
See https://github.com/dotnet/source-build/issues/328

sdmaclea commented 6 years ago

More recent builds have failed with issues related to GC holes with Generic Lists possibly similar to dotnet/coreclr#16892

ghost commented 6 years ago

@sdmaclea, can you reproduce it with tip to CoreCLR?

sdmaclea commented 6 years ago

can you reproduce it with tip to CoreCLR?

Based on comments from @janvorli I have not tried the coreclr tip because of suspected incompatibilities.

I generally have been using the latest cli daily build. Although I have tried to use the daily core-setup build with the most recent sdk from the daily cli.

I have run the coreclr version used in the daily build through the full set of coreclr tests without issue. There are some spurious gcStress issues.

I have not figured out how to build/run the corefx tests on arm64/ubuntu. Maybe that is the next step.

I have automation setup to try the most recent cli daily build and/or most recent core-setup daily build.

I can kick a run off today. Although a fix for the generic lists gc hole went in last night and won't be in the daily cli/core-setup build for a while.

I haven't really figured out how to debug this. As most of my experience with debugging coreclr is with small contained unit tests with source code.

ghost commented 6 years ago

@sdmaclea, thanks for the details. CoreSetup has received CoreCLR update 2 hrs. ago. CLI repo hasn't received the update. https://github.com/dotnet/cli/pull/8780 hasn't picked up latest CoreSetup update either (yet).

BruceForstall commented 6 years ago

I have not figured out how to build/run the corefx tests on arm64/ubuntu. Maybe that is the next step.

We might need something like is done for Windows/arm corefx testing in Jenkins: build the corefx tests using cross-compilation on Windows/x86, then copy them over to arm, and run them using the tests\scripts\run-corefx-tests.bat script.

Interesting that when I wrote that script I added the comment, "This script only works for Windows ARM, but perhaps should be extended to work for Windows ARM64 as well." But obviously that also applies to Linux/arm64. And it turns out I'm starting to think about doing corefx testing on Linux/arm32 hardware right now also.

sdmaclea commented 6 years ago

@BruceForstall

I'd love for them to be run in CI.

I also want to run them in QDT's (my) CI.

If I can find the corefx tests already built as Jenkins artifacts in CI then I can run them locally in QDT CI. Also, rumor is they are not as hard to build as the coreclr tests, so maybe they can be built on Linux as needed.

sdmaclea commented 6 years ago

I have duplicated the instability in corefx tests. These tests all have a similar issue to the SDK.

System.Threading.Channels.Tests
System.Threading.Tasks.Tests
System.Threading.Tasks.Dataflow.Tests

Closing in favor of dotnet/coreclr#17178