Closed am11 closed 2 years ago
Tagging subscribers to this area: @agocke, @vitek-karas, @vsadov See info in area-owners.md if you want to be subscribed.
Author: | am11 |
---|---|
Assignees: | - |
Labels: | `arch-arm64`, `os-mac-os-x`, `area-Single-File` |
Milestone: | - |
It broke in main
branch on Nov 3, 2021.
@jkoritzinsky, I have bisected the commits and found that the first commit (since .NET 6 release) which fails single-file app on osx-arm64 is 24e7a4a1a101d91b6666dc6f44137574246fdd9c (it was working until the previous commit c87e932d2b38b2929a8b1deb798682a3b122aa85). With debug build, it fails an assertion:
Assert failure(PID 70129 [0x000111f1], Thread: 5669656 [0x568318]): Consistency check failed: System.Environment::GetProcessorCount is not registered using DllImportentry macro in qcallentrypoints.cppFAILED: pvTarget != nullptr
File: /Users/am11/projects/runtime-pr/src/coreclr/vm/dllimport.cpp Line: 5449
Image: /Users/am11/projects/testapp1/bin/Debug/net7.0/osx-arm64/publish/testapp1
zsh: abort bin/Debug/net7.0/osx-arm64/publish/testapp1
I have debugged a bit and noticed that after this line (which does not fail): https://github.com/dotnet/runtime/blob/24e7a4a1a101d91b6666dc6f44137574246fdd9c/src/coreclr/vm/dllimport.cpp#L2750
p *ppEntryPointName
in lldb prints GetProcessorCount
instead of Environment_GetProcessorCount
. Any thoughts (or theories) what might be the cause of invalid mapping? 🤔
I have ran another git-bisect session, this time marking ProcessorCount
error with git bisect good
(basically ignoring it). Here is a more precise summary:
from release/6.0 branch-off commit until https://github.com/dotnet/runtime/commit/24e7a4a1a101d91b6666dc6f44137574246fdd9c ~1
, everything was fine. That commit started to fail QCall consistency check.
Assert failure(PID 41507 [0x0000a223], Thread: 6600605 [0x64b79d]): Consistency check failed: System.Environment::GetProcessorCount is not registered using DllImportentry macro in qcallentrypoints.cppFAILED: pvTarget != nullptr
File: /Users/am11/projects/runtime-pr/src/coreclr/vm/dllimport.cpp Line: 5436
Image: /Users/am11/projects/testapp1/bin/Debug/net7.0/osx-arm64/publish/testapp1
from 24e7a4a1a101d91b6666dc6f44137574246fdd9c until bcd35278ca879554ed98e522c007dc0025a19303 ~1
, the same consistency check was failing. With the latter commit, a different assertion has started to fail earlier in the execution. This is the case in the tip of main
branch.
Assert failure(PID 26205 [0x0000665d], Thread: 6557904 [0x6410d0]): Compiler optimization assumption invalid: EE expects method to exist: System.String:Ctor Sig pointer: 0000000105317690
FAILED: pMD != 0
File: /Users/am11/projects/runtime-pr/src/coreclr/vm/binder.cpp Line: 125
Image: /Users/am11/projects/testapp1/bin/Debug/net7.0/osx-arm64/publish/testapp1
If they are not related in terms of root-cause, then fixing 2 first will bring it back to state of 1.
@jkotas, (I can create a separate issue for 2 if needed) it looks like the issue is with the meta signature of METHOD__STRING__CTORF_CHARARRAY
that has first byte set to 0
but the one computed by MethodDesc::GetSigFromMetadata
has value 32
(which is probably incorrect?). Consequently, this comparison is failing:
https://github.com/dotnet/runtime/blob/9b3b937eb364dda4f91b6b5288c83f4e4f45e7e3/src/coreclr/vm/siginfo.cpp#L4281
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 10.1
frame #0: 0x00000001002a8a10 testapp1`MetaSig::CompareMethodSigs(pSignature1="", cSig1=5, pModule1=0x00000001764c0000, pSubst1=0x0000000000000000, pSignature2=" \U00000001\U0000000e\U0000001d\U00000003\a \U00000003\U00000001\U0000001d\U00000003\b\b2\U00000001", cSig2=5, pModule2=0x00000001764c0000, pSubst2=0x0000000000000000, skipReturnTypeSig=NO, pVisited=0x0000000000000000) at siginfo.cpp:4281:17
4278 (cSig1 == cSig2) &&
4279 (pSubst1 == NULL) &&
4280 (pSubst2 == NULL) &&
-> 4281 (memcmp(pSig1, pSig2, cSig1) == 0))
4282 {
4283 return TRUE;
4284 }
Target 0: (testapp1) stopped.
(lldb) p (int)memcmp(pSig1, pSig2, cSig1)
(int) $300 = -32
(lldb) p cSig1
(DWORD) $301 = 5
(lldb) memory read -s1 -fu -c5 pSig1 --force
0x100e0429e: 0
0x100e0429f: 1
0x100e042a0: 14
0x100e042a1: 29
0x100e042a2: 3
(lldb) memory read -s1 -fu -c5 pSig2 --force
0x108684d18: 32
0x108684d19: 1
0x108684d1a: 14
0x108684d1b: 29
0x108684d1c: 3
if i jump the PC to line 4283 and continue, the same 32 vs. 0 issue shows up for other string methods. For the non-string methods (like METHOD__CASTHELPERS__ISINSTANCEOFANY
, METHOD__CASTHELPERS__UNBOX
etc.), the comparison succeeds because both pSig1
and pSig2
have 0 in the first byte.
Neither of the two failure modes make sense. I think that the problem is likely a bad C++ codegen or something low-level like that.
p *ppEntryPointName in lldb prints GetProcessorCount instead of Environment_GetProcessorCount. Any thoughts (or theories) what might be the cause of invalid mapping? 🤔
Maybe mismatching bits - like a new singlefilehost
and old System.Private.CoreLib.dll
It would be hard to mismatch them though, since we build them together.
Yeah, I agree. This looks like mismatched bits.
@VSadov will it be fixed in the next preview?
When I am trying the scenario with latest daily build, it looks like bits are matching but R2R is broken.
EXC_BAD_ACCESS (code=1, address=0x580000ead28000d1)
-p:PublishTrimmed=true
. , which results in IL-only app, it runs and prints "Hello World"-p:PublishReadyToRun=true
, then app fails again with BAD_ACCESSexport COMPlus_ZapDisable=1
, the app worksIt looks like R2R is broken in singlefile on OSX. It is also likely that we are not running host tests on osx-arm64
BTW, when targeting osx-x64
, the app runs on the same machine (M1)
I will continue investigating.
the build that I picked up is:
strings ./testapp1 | grep @Commit
@(#)Version 7.0.22.22403 @Commit: 47d9c43ab1f10a98a348a28b3fd7ed9c4d35328b
It is also likely that we are not running host tests on osx-arm64
Single file tests were added to outerloop test pipeline in https://github.com/dotnet/runtime/commit/7677f7dc71fafad1f35639803b86d05b0bd7df72, and removed in https://github.com/dotnet/runtime/commit/f29ba20bec327dc18013abd0a867ab3a95448a73#diff-e2e027b9777fc35f4a8243db97ce50f7dac99b3cee9465c5325d283c34d2d872L655 for cost saving.
I think those are good tests to validate with frequent runtime changes and we should bring them back with osx-arm64
addition. AFAIK, there is nothing else in any pipeline testing single-file host (in runtime, sdk or installer repos). Issues are reported usually after the GA release.
I "think" we have an E2E test in the SDK repo (didn't check to be sure) - unfortunately I know that SDK or installer repo doesn't run tests on osx-arm64 either.
it looks like we sometimes see PE sections overlapping in memory. This is either a loader bug or crossgen bug. Most likely crossgen. Either way we should be able to layout a PE that we ourselves produce.
Same error with dotnet 6.0 on M1
thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x580000ead2800051)
frame #0: 0x00000001000b9c58 testDictionaryLayout::FindToken(MethodTable*, LoaderAllocator*, int, SigBuilder*, unsigned char*, DictionaryEntrySignatureSource, CORINFO_RUNTIME_LOOKUP*, unsigned short*) + 140 test
DictionaryLayout::FindToken:
-> 0x1000b9c58 <+140>: ldr x9, [x9, #0x8]
0x1000b9c5c <+144>: cbz x9, 0x1000b9c70 ; <+164>
0x1000b9c60 <+148>: ldr x12, [x9]
0x1000b9c64 <+152>: ldrh w9, [x12]
Target 0: (test) stopped.
Fix:
export COMPlus_ZapDisable=1
Pretty sure it was working fine with .NET 6 in March, without disabling zap. It is perhaps a recent regression? I haven't tested with latest patch version.
Here are the outputs:
→ dotnet --version 6.0.300
→ uname -a Darwin MBProMax.local 21.5.0 Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:37 PDT 2022; root:xnu-8020.121.3~4/RELEASE_ARM64_T6000 arm64
→ cat Program.cs // See https://aka.ms/new-console-template for more information var log = (object msg) => Console.WriteLine((new DateTimeOffset(DateTime.UtcNow).ToUnixTimeSeconds()).ToString() + ": " + msg);
log("Hello, World!");
→ dotnet publish --use-current-runtime -p:PublishSingleFile=true --self-contained -c Release Microsoft (R) Build Engine version 17.2.0+41abc5629 for .NET Copyright (C) Microsoft Corporation. All rights reserved.
Determining projects to restore... Restored /private/tmp/test/test.csproj (in 79 ms). test -> /private/tmp/test/bin/Release/net6.0/osx-arm64/test.dll Optimizing assemblies for size, which may change the behavior of the app. Be sure to test after publishing. See: https://aka.ms/dotnet-illink test -> /private/tmp/test/bin/Release/net6.0/osx-arm64/publish/
→ /private/tmp/test/bin/Release/net6.0/osx-arm64/publish/test zsh: segmentation fault /private/tmp/test/bin/Release/net6.0/osx-arm64/publish/test
@am11 can we re-open this for v6?
There is a separate issue for 6.0 - https://github.com/dotnet/runtime/issues/69923
Description
Latest build of .NET 7 published single-file app is crashing on execution.
Reproduction Steps
Expected behavior
Displays
Hello, World!
Actual behavior
Regression?
Yes, it woks with .NET 6.
Known Workarounds
Publish as self-contained, without
-p:PublishSingleFile=true
.Configuration
Daily build
Other information
I tried debugging it with native symbols (of release singlefilehost), the clrstack looks like this: