dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.95k stars 4.65k forks source link

[NativeAot] win-x86: Funclets getting double called #99687

Closed filipnavara closed 5 months ago

filipnavara commented 6 months ago

Several of the runtime tests fail with a similar symptom:

in try
  in try
    in try
    in finally
  in filter
in filter
    in finally
  in filter
in filter
caught an exception!

FAILED!

[EXPECTED OUTPUT]
in try
  in try
    in try
    in finally
  in filter
in filter
caught an exception!

[DIFF RESULT]
< caught an exception!
---
>     in finally

Logs of all failed tests that hit this symptom: testlogs.zip

dotnet-policy-service[bot] commented 6 months ago

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas See info in area-owners.md if you want to be subscribed.

jkotas commented 6 months ago

FWIW, I expect that there is going to be a long tail of issues related to Win x86 exception handling because of https://github.com/dotnet/runtime/pull/99191#issuecomment-1977373355 . I think we may need to get both CoreCLR and NAOT onto same plan to make Win x86 support in NAOT productized (or rethink the test strategy for Win x86 NAOT).

filipnavara commented 6 months ago

FWIW, I expect that there is going to be a long tail of issues related to Win x86 exception handling because of #99191 (comment) .

Very likely. I was hoping that there was some exposure of the code during linux-x86 CoreCLR bring up but apparently there are still gaps (possibly specific to the NativeAOT integration of the code).

I think we may need to get both CoreCLR and NAOT onto same plan to make Win x86 support in NAOT productized (or rethink the test strategy for Win x86 NAOT).

I'm focusing on getting the runtime tests passing in this configuration which is attacking the "test strategy" angle. I am open to exploring the CoreCLR+funclets option too. I just expect to run into the same issues regardless of whether we hit them first in CoreCLR or NativeAOT, so it makes sense to investigate them.

jkotas commented 6 months ago

exposure of the code during linux-x86 CoreCLR bring up

FWIW, linux-x86 on CoreCLR has not been brought up to our regular quality bar, and it is completely broken at the moment - https://github.com/dotnet/runtime/pull/96150#discussion_r1441676813 .

I am open to exploring the CoreCLR+funclets option too. I just expect to run into the same issues regardless of whether we hit them first in CoreCLR or NativeAOT, so it makes sense to investigate them.

Once you deal with the easy to repro functional issues, there are going to be stress related issues after that (GC reporting, etc.). We do not have the infrastructure like GC stress to find them on native AOT proactively. We depend on regular CoreCLR for that.

filipnavara commented 6 months ago

I suspect the cause will be switched argument order on the stack:

We do not have the infrastructure like GC stress to find them on native AOT proactively.

(I'd definitely prefer not to revive the GC stress code in NativeAOT.)

filipnavara commented 6 months ago

So far I tracked it down to the "tryEnd"/"tryLength" offset of EH_CLAUSE_FAULT being incorrect in the decoded EH info. This causes it to incorrectly match by code offset during the second pass processing.

filipnavara commented 6 months ago

win-x86: jitdump.x86.txt win-x64: jitdump.x64.txt

X86:

3 EH table entries, 0 duplicate clauses, 3 total EH entries reported to VM
setEHcount(cEH=3)
EH#0: try [0018..002A) handled by [002E..0053) (finally)
EH#1: try [000E..002A) handled by [0060..0071) filter at [0053..0060)
EH#2: try [0004..002B) handled by [007E..008F) filter at [0071..007E)

X64:

3 EH table entries, 0 duplicate clauses, 0 cloned finallys, 3 total EH entries reported to VM
setEHcount(cEH=3)
EH#0: try [0025..0032) handled by [0043..0075) (finally)
EH#1: try [0018..003B) handled by [0090..00B0) filter at [0075..0090)
EH#2: try [000B..003C) handled by [00CB..00EB) filter at [00B0..00CB)

The end offset of try in EH#0 on x86 is wrong.

filipnavara commented 6 months ago

I suspect this may be some artefact of FEATURE_EH_CALLFINALLY_THUNKS = 0.

filipnavara commented 6 months ago

Confirmed my suspicion. This fixes it:

--- a/src/coreclr/jit/targetx86.h
+++ b/src/coreclr/jit/targetx86.h
@@ -54,8 +54,13 @@
   #define FEATURE_EH               1       // To aid platform bring-up, eliminate exceptional EH clauses (catch, filter,
                                            // filter-handler, fault) and directly execute 'finally' clauses.

+#ifdef FEATURE_EH_FUNCLETS
+  #define FEATURE_EH_CALLFINALLY_THUNKS 1  // Generate call-to-finally code in "thunks" in the enclosing EH region,
+                                           // protected by "cloned finally" clauses.
+#else
   #define FEATURE_EH_CALLFINALLY_THUNKS 0  // Generate call-to-finally code in "thunks" in the enclosing EH region,
                                            // protected by "cloned finally" clauses.
+#endif
   #define ETW_EBP_FRAMED           1       // if 1 we cannot use EBP as a scratch register and must create EBP based
                                            // frames for most methods
   #define CSE_CONSTS               1       // Enable if we want to CSE constants