Open LoopedBard3 opened 5 days ago
Pretty sure I have a CoreDump from some of these failed runs if that would be useful.
Yes, that would be useful. Are you able to extract the stacktrace from the coredumps? It would help with routing of this issue.
(https://learn.microsoft.com/en-us/troubleshoot/developer/webapps/aspnetcore/practice-troubleshoot-linux/lab-1-2-analyze-core-dumps-lldb-debugger has the steps.)
Crash during GC at:
0:000> k
# Child-SP RetAddr Call Site
00 (Inline Function) --------`-------- libcoreclr!MethodTable::GetFlag [/__w/1/s/src/coreclr/vm/methodtable.h @ 3655]
01 (Inline Function) --------`-------- libcoreclr!MethodTable::HasComponentSize [/__w/1/s/src/coreclr/vm/../gc/gcinterface.h @ 1699]
02 (Inline Function) --------`-------- libcoreclr!SVR::my_get_size+0x7 [/crossrootfs/x64/usr/include/stdint.h @ 11552]
03 (Inline Function) --------`-------- libcoreclr!SVR::gc_heap::add_to_promoted_bytes+0x7 [/crossrootfs/x64/usr/include/stdint.h @ 26299]
04 00007a14`cb50b840 00007a15`58e5ad7a libcoreclr!SVR::gc_heap::mark_object_simple1+0xab7 [/crossrootfs/x64/usr/include/stdint.h @ 27115]
05 00007a14`cb50b8d0 00007a15`58e61cf0 libcoreclr!SVR::gc_heap::mark_object_simple+0x30a [/__w/1/s/src/coreclr/gc/gc.cpp @ 15732480]
06 (Inline Function) --------`-------- libcoreclr!SVR::gc_heap::mark_through_cards_helper+0xba [/__w/1/s/src/coreclr/gc/gc.cpp @ 41065]
07 00007a14`cb50b940 00007a15`58e4b6d4 libcoreclr!SVR::gc_heap::mark_through_cards_for_uoh_objects+0xbd0 [/crossrootfs/x64/usr/include/stdint.h @ 46548]
08 00007a14`cb50ba90 00007a15`58e45949 libcoreclr!SVR::gc_heap::mark_phase+0xe94 [/__w/1/s/src/coreclr/gc/gc.cpp @ 29669]
09 00007a14`cb50bb70 00007a15`58e2b465 libcoreclr!SVR::gc_heap::gc1+0x2c9 [/__w/1/s/src/coreclr/gc/gc.cpp @ 15732480]
0a 00007a14`cb50bc40 00007a15`58e27e8d libcoreclr!SVR::gc_heap::garbage_collect+0xa85 [/__w/1/s/src/coreclr/gc/gc.cpp @ 24361]
0b 00007a14`cb50bce0 00007a15`58e26906 (T) libcoreclr!SVR::gc_heap::gc_thread_function+0x157d [/__w/1/s/src/coreclr/gc/gc.cpp @ 7175]
0c 00007a14`cb50bd60 00007a15`58d4583e libcoreclr!SVR::gc_heap::gc_thread_stub+0x31 [/__w/1/s/src/coreclr/gc/gc.cpp @ 37262]
The GC heap is corrupted:
0:000> !verifyheap
*** WARNING: Unable to verify timestamp for doublemapper (deleted)
Heap Segment Object Failure Reason
1 79d443483540 79d4dc35f8f0 InvalidObjectReference Object 79d4dc35f8f0 has a bad member at offset 8: 79d4e0600a98
3 79d443483de0 79d4df004068 InvalidObjectReference Object 79d4df004068 has a bad member at offset 10: 79d4e0600a98
3 79d443483de0 79d4df0040b8 InvalidObjectReference Object 79d4df0040b8 has a bad member at offset 8: 79d4e0600a98
This is likely duplicate of https://github.com/dotnet/runtime/issues/102919 , fixed by https://github.com/dotnet/runtime/pull/103301
@LoopedBard3 Could you please let us know whether you still see it crashing after picking up a build that includes https://github.com/dotnet/runtime/pull/103301?
Yup, will watch for if the update fixes the issue 👍.
Looking at one of the recent failing runs, #103301 does not seem to have fixed the issue. The SDK verison used in this recent build that still hit the failure had commit https://github.com/dotnet/sdk/commit/e18cfb7a09d74952d5e9c2448d31dee313e059bb and had a Microsoft.NETCore.App.Ref commit of https://github.com/dotnet/runtime/commit/a900bbf6fcf33fa2e799ed599ab86e00d6124c05 (from Version.Details.xml#L19-L20). If there is a different version/link I should be looking at to make sure we have the update, let me know.
Looking at one of the recent failing runs, https://github.com/dotnet/runtime/pull/103301
Would it possible to set DOTNET_GCDynamicAdaptationMode=0
environment variable in your build and see whether it still reproduces the crashes? It would be a very useful data point for us.
This issue has been marked needs-author-action
and may be missing some important information.
This error is hit in various superpmi collect test legs for Linux. https://dev.azure.com/dnceng/internal/_build/results?buildId=2475019&view=logs&j=51e06289-9d30-5d49-3504-00701cf41df4&t=f1526e20-d436-504d-a3b0-15f037d2d591
Description
In the dotnet-runtime-perf pipeline, we are seeing multiple Linux jobs hitting the error
dotnet/x64/sdk/9.0.100-preview.7.24323.5/Roslyn/Microsoft.CSharp.Core.targets(85,5): error MSB6006: "csc.dll" exited with code 139.
when building our MicroBenchmarks.csproj file for BDN testing. This is occurring on between 0-3 of the 30 helix workitems we send out for each job with no consistency for which of the 30 workitems is affected or the agent machine hitting the error. Pretty sure I have a CoreDump from some of these failed runs if that would be useful.Potentially related to: https://github.com/dotnet/runtime/issues/57558
Reproduction Steps
Need to test more but this should work for reproing, though as mentioned in the description, hitting the error is not consistent.
Steps (high level):
python3 ./scripts/benchmarks_ci.py --csproj ./src/benchmarks/micro/MicroBenchmarks.csproj --incremental no --architecture x64 -f net9.0 --dotnet-versions 9.0.100-preview.6.24320.9 --bdn-arguments="--anyCategories Libraries Runtime --logBuildOutput --generateBinLog --partition-count 30 --partition-index 29"
Steps (inner command, this should match but ping if this seems to be missing a step):
dotnet-install.sh -InstallDir ./performance/tools/dotnet/x64 -Architecture x64 -Version 9.0.100-preview.6.24320.9
dotnet run --project ./src/benchmarks/micro/MicroBenchmarks.csproj --configuration Release --framework net9.0 --no-restore --no-build -- --anyCategories Libraries Runtime "" --logBuildOutput --generateBinLog --partition-count 30 --partition-index 29 --artifacts ./artifacts/BenchmarkDotNet.Artifacts --packages ./artifacts/packages --buildTimeout 1200
Expected behavior
Build is successful and continues to run the BenchmarkDotNet tests.
Actual behavior
The build fails
Full logs from example run with the error available: dotnet-runtime-perf Run 20240620.3. The specific partitions are Partition 2 and Partition 6 from the job 'Performance linux x64 release coreclr JIT micro perfowl NoJS False False False net9.0'.
Regression?
This started occurring between our runs dotnet-runtime-perf Run 20240620.2 and dotnet-runtime-perf Run 20240620.3.
The runtime repo comparison for between these two jobs is https://github.com/dotnet/runtime/compare/4a7fe654d798a372f5786f026006437444f14f1e...b0c4728305b98c0ae22d90f72c805aecb628ba8c. Our performance repo also took one update but it seems highly unlikely to be related: https://github.com/dotnet/performance/pull/4279. Version difference information available in the information section below.
Known Workarounds
None
Configuration
.NET Version information: Information from first run with error dotnet-runtime-perf Run 20240620.3:
Information from run before error dotnet-runtime-perf Run 20240620.2:
This is happening across multiple different machine hardware configurations.
Other information
No response