dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.18k stars 4.72k forks source link

Test failure BasicEventSourceTests.TestsWrite.Test_Write_T_ETW #91769

Closed v-wenyuxu closed 11 months ago

v-wenyuxu commented 1 year ago

Failed in: runtime-coreclr libraries-jitstress 20230907.1

Failed tests:

net9.0-windows-Release-arm64-CoreCLR_checked-jitminopts-Windows.11.Arm64.Open
    - BasicEventSourceTests.TestsWrite.Test_Write_T_ETW

Error message:

System.Runtime.InteropServices.COMException : The instance name passed was not recognized as valid by a WMI data provider. (0x80071069)

Stack trace:

   at System.Runtime.InteropServices.Marshal.ThrowExceptionForHR(Int32 errorCode) in /_/src/libraries/System.Private.CoreLib/src/System/Runtime/InteropServices/Marshal.cs:line 849
   at Microsoft.Diagnostics.Tracing.Session.TraceEventSession.Flush()
   at BasicEventSourceTests.EtwListener.Dispose() in /_/src/libraries/System.Diagnostics.Tracing/tests/BasicEventSourceTest/Harness/EtwListener.cs:line 97
   at BasicEventSourceTests.TestsWrite.Test_Write_T_ETW() in /_/src/libraries/System.Diagnostics.Tracing/tests/BasicEventSourceTest/TestsWrite.Etw.cs:line 27
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodBaseInvoker.InterpretedInvoke_Method(Object obj, IntPtr* args) in /_/src/coreclr/System.Private.CoreLib/src/System/Reflection/MethodBaseInvoker.CoreCLR.cs:line 36
   at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr) in /_/src/libraries/System.Private.CoreLib/src/System/Reflection/MethodBaseInvoker.cs:line 57
ghost commented 1 year ago

Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.

Issue Details
**Failed in:** [runtime-coreclr libraries-jitstress 20230907.1](https://dev.azure.com/dnceng-public/public/_build/results?buildId=398021&view=ms.vss-test-web.build-test-results-tab&runId=8590802&resultId=137551&paneView=debug) **Failed tests:** ``` net9.0-windows-Release-arm64-CoreCLR_checked-jitminopts-Windows.11.Arm64.Open - BasicEventSourceTests.TestsWrite.Test_Write_T_ETW ``` **Error message:** ``` System.Runtime.InteropServices.COMException : The instance name passed was not recognized as valid by a WMI data provider. (0x80071069) ``` **Stack trace:** ``` at System.Runtime.InteropServices.Marshal.ThrowExceptionForHR(Int32 errorCode) in /_/src/libraries/System.Private.CoreLib/src/System/Runtime/InteropServices/Marshal.cs:line 849 at Microsoft.Diagnostics.Tracing.Session.TraceEventSession.Flush() at BasicEventSourceTests.EtwListener.Dispose() in /_/src/libraries/System.Diagnostics.Tracing/tests/BasicEventSourceTest/Harness/EtwListener.cs:line 97 at BasicEventSourceTests.TestsWrite.Test_Write_T_ETW() in /_/src/libraries/System.Diagnostics.Tracing/tests/BasicEventSourceTest/TestsWrite.Etw.cs:line 27 at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor) at System.Reflection.MethodBaseInvoker.InterpretedInvoke_Method(Object obj, IntPtr* args) in /_/src/coreclr/System.Private.CoreLib/src/System/Reflection/MethodBaseInvoker.CoreCLR.cs:line 36 at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr) in /_/src/libraries/System.Private.CoreLib/src/System/Reflection/MethodBaseInvoker.cs:line 57 ```
Author: v-wenyuxu
Assignees: -
Labels: `arch-arm64`, `os-windows`, `JitStress`, `area-Diagnostics-coreclr`, `untriaged`, `needs-area-label`
Milestone: -
ghost commented 1 year ago

Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti See info in area-owners.md if you want to be subscribed.

Issue Details
**Failed in:** [runtime-coreclr libraries-jitstress 20230907.1](https://dev.azure.com/dnceng-public/public/_build/results?buildId=398021&view=ms.vss-test-web.build-test-results-tab&runId=8590802&resultId=137551&paneView=debug) **Failed tests:** ``` net9.0-windows-Release-arm64-CoreCLR_checked-jitminopts-Windows.11.Arm64.Open - BasicEventSourceTests.TestsWrite.Test_Write_T_ETW ``` **Error message:** ``` System.Runtime.InteropServices.COMException : The instance name passed was not recognized as valid by a WMI data provider. (0x80071069) ``` **Stack trace:** ``` at System.Runtime.InteropServices.Marshal.ThrowExceptionForHR(Int32 errorCode) in /_/src/libraries/System.Private.CoreLib/src/System/Runtime/InteropServices/Marshal.cs:line 849 at Microsoft.Diagnostics.Tracing.Session.TraceEventSession.Flush() at BasicEventSourceTests.EtwListener.Dispose() in /_/src/libraries/System.Diagnostics.Tracing/tests/BasicEventSourceTest/Harness/EtwListener.cs:line 97 at BasicEventSourceTests.TestsWrite.Test_Write_T_ETW() in /_/src/libraries/System.Diagnostics.Tracing/tests/BasicEventSourceTest/TestsWrite.Etw.cs:line 27 at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor) at System.Reflection.MethodBaseInvoker.InterpretedInvoke_Method(Object obj, IntPtr* args) in /_/src/coreclr/System.Private.CoreLib/src/System/Reflection/MethodBaseInvoker.CoreCLR.cs:line 36 at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr) in /_/src/libraries/System.Private.CoreLib/src/System/Reflection/MethodBaseInvoker.cs:line 57 ```
Author: v-wenyuxu
Assignees: -
Labels: `arch-arm64`, `area-System.Diagnostics.Tracing`, `os-windows`, `JitStress`, `area-Diagnostics-coreclr`, `needs-area-label`
Milestone: 9.0.0
v-wenyuxu commented 1 year ago

Failed again in: runtime-coreclr libraries-jitstress 20230912.1

Failed tests:

net9.0-windows-Release-arm64-CoreCLR_checked-zapdisable-Windows.11.Arm64.Open
    - BasicEventSourceTests.TestsWrite.Test_Write_T_ETW

Error message:

System.Runtime.InteropServices.COMException : The instance name passed was not recognized as valid by a WMI data provider. (0x80071069)

Stack trace:

   at Microsoft.Diagnostics.Tracing.Session.TraceEventSession.Flush()
   at BasicEventSourceTests.EtwListener.Dispose() in /_/src/libraries/System.Diagnostics.Tracing/tests/BasicEventSourceTest/Harness/EtwListener.cs:line 97
   at BasicEventSourceTests.TestsWrite.Test_Write_T_ETW() in /_/src/libraries/System.Diagnostics.Tracing/tests/BasicEventSourceTest/TestsWrite.Etw.cs:line 27
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr) in /_/src/libraries/System.Private.CoreLib/src/System/Reflection/MethodBaseInvoker.cs:line 57
danmoseley commented 1 year ago

I see two problems

  1. the tests in this assembly are expected to run entirely serially (for example, many do new EtwListener() which means they all use the same file:
        public EtwListener(string dataFileName = "EventSourceTestData.etl", string sessionName = "EventSourceTestSession")

    for this reason they try hard to run serially, disabling parallelism 3 ways -- https://github.com/dotnet/runtime/blob/e555a0e16635c6ff2d88366f19d0781ac1fef0ee/src/libraries/System.Diagnostics.Tracing/tests/BasicEventSourceTest/XUnitAssemblyAttributes.cs#L7

this is apparently not working however

  Discovering: System.Diagnostics.Tracing.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Diagnostics.Tracing.Tests (found 45 of 49 test cases)
  Starting:    System.Diagnostics.Tracing.Tests (parallel test collections = on, max threads = 4) <<<<<<<<<<<<<

and that may be causing this

      System.Runtime.InteropServices.COMException : The instance name passed was not recognized as valid by a WMI data provider. (0x80071069)
      Stack Trace:
           at Microsoft.Diagnostics.Tracing.Session.TraceEventSession.Flush()
        /_/src/libraries/System.Diagnostics.Tracing/tests/BasicEventSourceTest/Harness/EtwListener.cs(97,0): at BasicEventSourceTests.EtwListener.Dispose()
        /_/src/libraries/System.Diagnostics.Tracing/tests/BasicEventSourceTest/TestsWrite.Etw.cs(27,0): at BasicEventSourceTests.TestsWrite.Test_Write_T_ETW()
           at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
        /_/src/libraries/System.Private.CoreLib/src/System/Reflection/MethodBaseInvoker.cs(57,0): at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr)

Question is -- why is parallelism happening

  1. several tests create TraceEventSessions, and those are not safe for parallel construction, because they have some static members, including https://github.com/microsoft/perfview/blob/main/src/TraceEvent/TraceEventSession.cs#L1724 which is a SortedDictionary, and there is no locking to prevent it being populated by one thread while it's being read by another https://github.com/microsoft/perfview/blob/main/src/TraceEvent/TraceEventSession.cs#L1634 SortedDictionary is not safe for concurrent read and write.

hence this output

Unhandled exception. System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Generic.SortedSet`1.Node.get_Is4Node() in /_/src/libraries/System.Collections/src/System/Collections/Generic/SortedSet.cs:line 1622
   at System.Collections.Generic.SortedSet`1.AddIfNotPresent(T item) in /_/src/libraries/System.Collections/src/System/Collections/Generic/SortedSet.cs:line 339
   at System.Collections.Generic.TreeSet`1.AddIfNotPresent(T item) in /_/src/libraries/System.Collections/src/System/Collections/Generic/SortedDictionary.cs:line 901
   at Microsoft.Diagnostics.Tracing.Session.TraceEventSession.get_ProviderNameToGuid()
   at Microsoft.Diagnostics.Tracing.Session.TraceEventProviders.GetProviderGuidByName(String name)
   at Microsoft.Diagnostics.Tracing.Session.TraceEventSession.EnableProvider(String providerName, TraceEventLevel providerLevel, UInt64 matchAnyKeywords, TraceEventProviderOptions options)
   at BasicEventSourceTests.TestsManifestGeneration.<Test_EventSource_EtwManifestGenerationRollover>b__3_0() in /_/src/libraries/System.Diagnostics.Tracing/tests/BasicEventSourceTest/TestsManifestGeneration.Etw.cs:line 98

I first assumed the first issue is simply exposing the second. But these are happening in remote exec processes. Only one thing should be happening in the process. Not clear what is going on. But whatever is causing this to be exposed -- is it a bug that should be fixed in TraceEventSession? (in dotnet/perfview repo -- @brianrob )?

https://[helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-heads-main-c3a451dbf00f46df93/System.Diagnostics.Tracing.Tests/1/console.a62658f8.log?helixlogtype=result](https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-heads-main-c3a451dbf00f46df93/System.Diagnostics.Tracing.Tests/1/console.a62658f8.log?helixlogtype=result)

@noahfalk @tarekgh

tarekgh commented 1 year ago

CC @davmason as it is EventSource issue.

brianrob commented 1 year ago

My best guess is that the NullReferenceException is due to unprotected concurrent calls to TraceEventSession.ProviderNameToGuid. Once this fails, the session likely doesn't get fully created and calling Flush results in further failures because the underlying session wasn't actually created.

From what I can see, the fix is to address the unexpected concurrency. TraceEventSession, especially with the same session name and file, is not designed to be used concurrently and can result in all sorts of unexpected behaviors.

danmoseley commented 1 year ago

TraceEventSession, especially with the same session name and file, is not designed to be used concurrently and can result in all sorts of unexpected behaviors.

@brianrob is constructing two TraceEventSession concurrently (with different file and name parameters) expected to be unsafe?

brianrob commented 1 year ago

TraceEventSession, especially with the same session name and file, is not designed to be used concurrently and can result in all sorts of unexpected behaviors.

@brianrob is constructing two TraceEventSession concurrently (with different file and name parameters) expected to be unsafe?

This should be fine. The key is that the name and file parameters are used to create machine-wide constructs, and that's why it's a problem if they are the same (unless you're explicitly attaching to an existing session, but that's another story and not what you're doing here). If you wanted to run tests in parallel, you could change the session names and ETL file paths and that should be OK. You just don't want too much parallelism because each machine has a maximum number of concurrent sessions, and it's not difficult to hit that, especially for Microsoft machines.

danmoseley commented 1 year ago

@brianrob the reason I'm asking is that it appears that is not safe eg., the static SortedDictionary can get written by two concurrent threads.

brianrob commented 1 year ago

I don't think there is a strong statement around these APIs as being thread-safe, but many of them are. It feels to me like you should be able to call EnableProvider on two different sessions concurrently. I'll make a fix for this.

brianrob commented 1 year ago

Posted https://github.com/microsoft/perfview/pull/1934 to address this.

danmoseley commented 1 year ago

OK, this issue is now tracking why our elaborate protections against any test concurrency in System.Diagnostics.Tracing.Tests are not doing the job...

davmason commented 1 year ago

@danmoseley is there a better person to assign this to since we're now tracking test concurrency issues? I don't believe I'm the right person to investigate issues in the test runner