dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.54k stars 4.54k forks source link

Occasionally hitting error MSB6006: "csc.dll" exited with code 139 on linux #104123

Open LoopedBard3 opened 5 days ago

LoopedBard3 commented 5 days ago

Description

In the dotnet-runtime-perf pipeline, we are seeing multiple Linux jobs hitting the error dotnet/x64/sdk/9.0.100-preview.7.24323.5/Roslyn/Microsoft.CSharp.Core.targets(85,5): error MSB6006: "csc.dll" exited with code 139. when building our MicroBenchmarks.csproj file for BDN testing. This is occurring on between 0-3 of the 30 helix workitems we send out for each job with no consistency for which of the 30 workitems is affected or the agent machine hitting the error. Pretty sure I have a CoreDump from some of these failed runs if that would be useful.

Potentially related to: https://github.com/dotnet/runtime/issues/57558

Reproduction Steps

Need to test more but this should work for reproing, though as mentioned in the description, hitting the error is not consistent.

Steps (high level):

  1. Clone dotnet/performance.
  2. From the top level the performance repo, run python3 ./scripts/benchmarks_ci.py --csproj ./src/benchmarks/micro/MicroBenchmarks.csproj --incremental no --architecture x64 -f net9.0 --dotnet-versions 9.0.100-preview.6.24320.9 --bdn-arguments="--anyCategories Libraries Runtime --logBuildOutput --generateBinLog --partition-count 30 --partition-index 29"
  3. If BDN tests start running successful, you did not hit the error.

Steps (inner command, this should match but ping if this seems to be missing a step):

  1. Clone dotnet/performance.
  2. Install dotnet version equal or newer than 9.0.100-preview.6.24320.9 with dotnet-install.sh: dotnet-install.sh -InstallDir ./performance/tools/dotnet/x64 -Architecture x64 -Version 9.0.100-preview.6.24320.9
  3. From the top level of the performance repo, run dotnet run --project ./src/benchmarks/micro/MicroBenchmarks.csproj --configuration Release --framework net9.0 --no-restore --no-build -- --anyCategories Libraries Runtime "" --logBuildOutput --generateBinLog --partition-count 30 --partition-index 29 --artifacts ./artifacts/BenchmarkDotNet.Artifacts --packages ./artifacts/packages --buildTimeout 1200

Expected behavior

Build is successful and continues to run the BenchmarkDotNet tests.

Actual behavior

The build fails

dotnet build /home/helixbot/work/B45E09D9/w/AC2C09AE/e/performance/src/benchmarks/micro/MicroBenchmarks.csproj --configuration Release --framework net9.0 --no-restore /p:NuGetPackageRoot=/home/helixbot/work/B45E09D9/w/AC2C09AE/e/performance/artifacts/packages /p:RestorePackagesPath=/home/helixbot/work/B45E09D9/w/AC2C09AE/e/performance/artifacts/packages /p:UseSharedCompilation=false /p:BuildInParallel=false /m:1
   Reporting -> /home/helixbot/work/B45E09D9/w/AC2C09AE/e/performance/artifacts/bin/Reporting/Release/netstandard2.0/Reporting.dll
   BenchmarkDotNet.Extensions -> /home/helixbot/work/B45E09D9/w/AC2C09AE/e/performance/artifacts/bin/BenchmarkDotNet.Extensions/Release/netstandard2.0/BenchmarkDotNet.Extensions.dll
/home/helixbot/work/B45E09D9/w/AC2C09AE/e/performance/tools/dotnet/x64/sdk/9.0.100-preview.7.24323.5/Roslyn/Microsoft.CSharp.Core.targets(85,5): error MSB6006: "csc.dll" exited with code 139. 

Full logs from example run with the error available: dotnet-runtime-perf Run 20240620.3. The specific partitions are Partition 2 and Partition 6 from the job 'Performance linux x64 release coreclr JIT micro perfowl NoJS False False False net9.0'.

Regression?

This started occurring between our runs dotnet-runtime-perf Run 20240620.2 and dotnet-runtime-perf Run 20240620.3.

The runtime repo comparison for between these two jobs is https://github.com/dotnet/runtime/compare/4a7fe654d798a372f5786f026006437444f14f1e...b0c4728305b98c0ae22d90f72c805aecb628ba8c. Our performance repo also took one update but it seems highly unlikely to be related: https://github.com/dotnet/performance/pull/4279. Version difference information available in the information section below.

Known Workarounds

None

Configuration

.NET Version information: Information from first run with error dotnet-runtime-perf Run 20240620.3:

$ dotnet --info
.NET SDK:
 Version:           9.0.100-preview.6.24320.9
 Commit:            7822425c3e
 Workload version:  9.0.100-manifests.cc027b4d
 MSBuild version:   17.11.0-preview-24318-05+4a45d5633

Runtime Environment:
 OS Name:     ubuntu
 OS Version:  22.04
 OS Platform: Linux
 RID:         linux-x64
 Base Path:   <Path>/performance/tools/dotnet/x64/sdk/9.0.100-preview.6.24320.9/

.NET workloads installed:
Configured to use loose manifests when installing new manifests.
There are no installed workloads to display.

Host:
  Version:      9.0.0-preview.6.24319.11
  Architecture: x64
  Commit:       static

.NET SDKs installed:
  9.0.100-preview.6.24320.9 [<Path>/performance/tools/dotnet/x64/sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.App 9.0.0-preview.6.24320.4 [<Path>/performance/tools/dotnet/x64/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 9.0.0-preview.6.24319.11 [<Path>/performance/tools/dotnet/x64/shared/Microsoft.NETCore.App]

Information from run before error dotnet-runtime-perf Run 20240620.2:

$ dotnet --info
.NET SDK:
 Version:           9.0.100-preview.6.24319.5
 Commit:            f3ebfb5ccb
 Workload version:  9.0.100-manifests.bae61ee5
 MSBuild version:   17.11.0-preview-24318-02+0a3683cf7

Runtime Environment:
 OS Name:     ubuntu
 OS Version:  22.04
 OS Platform: Linux
 RID:         linux-x64
 Base Path:   <path>/performance/tools/dotnet/x64/sdk/9.0.100-preview.6.24319.5/

.NET workloads installed:
Configured to use loose manifests when installing new manifests.
There are no installed workloads to display.

Host:
  Version:      9.0.0-preview.6.24307.2
  Architecture: x64
  Commit:       static

.NET SDKs installed:
  9.0.100-preview.6.24319.5 [<path>/performance/tools/dotnet/x64/sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.App 9.0.0-preview.6.24309.2 [<path>/performance/tools/dotnet/x64/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 9.0.0-preview.6.24307.2 [<path>/performance/tools/dotnet/x64/shared/Microsoft.NETCore.App]

This is happening across multiple different machine hardware configurations.

Other information

No response

jkotas commented 5 days ago

Pretty sure I have a CoreDump from some of these failed runs if that would be useful.

Yes, that would be useful. Are you able to extract the stacktrace from the coredumps? It would help with routing of this issue.

(https://learn.microsoft.com/en-us/troubleshoot/developer/webapps/aspnetcore/practice-troubleshoot-linux/lab-1-2-analyze-core-dumps-lldb-debugger has the steps.)

jkotas commented 5 days ago

Example of a crash: https://dev.azure.com/dnceng/internal/_build/results?buildId=2478580&view=ms.vss-test-web.build-test-results-tab&runId=53769489&resultId=100053&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab

Crash during GC at:

0:000> k
 # Child-SP          RetAddr               Call Site
00 (Inline Function) --------`--------     libcoreclr!MethodTable::GetFlag [/__w/1/s/src/coreclr/vm/methodtable.h @ 3655] 
01 (Inline Function) --------`--------     libcoreclr!MethodTable::HasComponentSize [/__w/1/s/src/coreclr/vm/../gc/gcinterface.h @ 1699] 
02 (Inline Function) --------`--------     libcoreclr!SVR::my_get_size+0x7 [/crossrootfs/x64/usr/include/stdint.h @ 11552] 
03 (Inline Function) --------`--------     libcoreclr!SVR::gc_heap::add_to_promoted_bytes+0x7 [/crossrootfs/x64/usr/include/stdint.h @ 26299] 
04 00007a14`cb50b840 00007a15`58e5ad7a     libcoreclr!SVR::gc_heap::mark_object_simple1+0xab7 [/crossrootfs/x64/usr/include/stdint.h @ 27115] 
05 00007a14`cb50b8d0 00007a15`58e61cf0     libcoreclr!SVR::gc_heap::mark_object_simple+0x30a [/__w/1/s/src/coreclr/gc/gc.cpp @ 15732480] 
06 (Inline Function) --------`--------     libcoreclr!SVR::gc_heap::mark_through_cards_helper+0xba [/__w/1/s/src/coreclr/gc/gc.cpp @ 41065] 
07 00007a14`cb50b940 00007a15`58e4b6d4     libcoreclr!SVR::gc_heap::mark_through_cards_for_uoh_objects+0xbd0 [/crossrootfs/x64/usr/include/stdint.h @ 46548] 
08 00007a14`cb50ba90 00007a15`58e45949     libcoreclr!SVR::gc_heap::mark_phase+0xe94 [/__w/1/s/src/coreclr/gc/gc.cpp @ 29669] 
09 00007a14`cb50bb70 00007a15`58e2b465     libcoreclr!SVR::gc_heap::gc1+0x2c9 [/__w/1/s/src/coreclr/gc/gc.cpp @ 15732480] 
0a 00007a14`cb50bc40 00007a15`58e27e8d     libcoreclr!SVR::gc_heap::garbage_collect+0xa85 [/__w/1/s/src/coreclr/gc/gc.cpp @ 24361] 
0b 00007a14`cb50bce0 00007a15`58e26906 (T) libcoreclr!SVR::gc_heap::gc_thread_function+0x157d [/__w/1/s/src/coreclr/gc/gc.cpp @ 7175] 
0c 00007a14`cb50bd60 00007a15`58d4583e     libcoreclr!SVR::gc_heap::gc_thread_stub+0x31 [/__w/1/s/src/coreclr/gc/gc.cpp @ 37262] 

The GC heap is corrupted:

0:000> !verifyheap
*** WARNING: Unable to verify timestamp for doublemapper (deleted)
Heap Segment          Object           Failure                          Reason
1    79d443483540     79d4dc35f8f0     InvalidObjectReference           Object 79d4dc35f8f0 has a bad member at offset 8: 79d4e0600a98
3    79d443483de0     79d4df004068     InvalidObjectReference           Object 79d4df004068 has a bad member at offset 10: 79d4e0600a98
3    79d443483de0     79d4df0040b8     InvalidObjectReference           Object 79d4df0040b8 has a bad member at offset 8: 79d4e0600a98
jkotas commented 5 days ago

This is likely duplicate of https://github.com/dotnet/runtime/issues/102919 , fixed by https://github.com/dotnet/runtime/pull/103301

jkotas commented 5 days ago

@LoopedBard3 Could you please let us know whether you still see it crashing after picking up a build that includes https://github.com/dotnet/runtime/pull/103301?

LoopedBard3 commented 4 days ago

Yup, will watch for if the update fixes the issue 👍.

LoopedBard3 commented 1 day ago

Looking at one of the recent failing runs, #103301 does not seem to have fixed the issue. The SDK verison used in this recent build that still hit the failure had commit https://github.com/dotnet/sdk/commit/e18cfb7a09d74952d5e9c2448d31dee313e059bb and had a Microsoft.NETCore.App.Ref commit of https://github.com/dotnet/runtime/commit/a900bbf6fcf33fa2e799ed599ab86e00d6124c05 (from Version.Details.xml#L19-L20). If there is a different version/link I should be looking at to make sure we have the update, let me know.

jkotas commented 1 day ago

Looking at one of the recent failing runs, https://github.com/dotnet/runtime/pull/103301

Would it possible to set DOTNET_GCDynamicAdaptationMode=0 environment variable in your build and see whether it still reproduces the crashes? It would be a very useful data point for us.

dotnet-policy-service[bot] commented 1 day ago

This issue has been marked needs-author-action and may be missing some important information.

JulieLeeMSFT commented 12 hours ago

This error is hit in various superpmi collect test legs for Linux. https://dev.azure.com/dnceng/internal/_build/results?buildId=2475019&view=logs&j=51e06289-9d30-5d49-3504-00701cf41df4&t=f1526e20-d436-504d-a3b0-15f037d2d591