OPCFoundation / UA-.NETStandard

OPC Unified Architecture .NET Standard
Other
1.96k stars 948 forks source link

OPC UA Server becomes unresponsive after a day or two #1635

Closed vinaybr closed 2 years ago

vinaybr commented 2 years ago

OPC UA Server becomes unresponsive after a day or two. In particular i have observed this frequently with the COMIop Sample server. Looking at the performance counters of the process i see that the thread count is very high. Based on this i took a dump file of the process and checked on callstacks of all threads using windbg and found that all the threads are stuck in this callstack -

        Child SP               IP Call Site
000000B59187E6D8 00007ffb1de4d8c4 [GCFrame: 000000b59187e6d8] 
000000B59187E848 00007ffb1de4d8c4 [GCFrame: 000000b59187e848] 
000000B59187E898 00007ffb1de4d8c4 [HelperMethodFrame_1OBJ: 000000b59187e898] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
000000B59187E9F0 00007ffa64f4d832 **Opc.Ua.Server.ServerInternalData.get_IsRunning**()

I see around 1400 threads stuck in this this Property. Please let me know if anyone else has faced a similar issue.

I have configured kepware simulation server on COMIop sample with 2 endpoints for DA and HDA profiles. Subscribed to a bunch of tags from the DA endpoints and a few from the HDA endpoint. Left it running and i see the data stop updating on the client(UA Expert) anywhere between 10hours to 2 days. All components are running on a single windows 10 machine. The target runtime would be .NetCore 3.1

vinaybr commented 2 years ago

Hi @mregen,

What would be the next course of action in this regard. We are currently stuck with this issue and we haven't been able to find a workaround for the same.

mrsuciu commented 2 years ago

@vinaybr The IsRunning property contains a lock and from what I see, I can speculate that all the threads are stuck trying to acquire it. Did you try to see if this lock is acquired in another stack and is never released or is deadlocked with another lock?

vinaybr commented 2 years ago

@mrsuciu there are thousands of threads stuck and I checked a few threads randomly and found that all of them were stuck in the get of Is running.

mrsuciu commented 2 years ago

@vinaybr Can you share the dump file you've taken which contains this situation ?

vinaybr commented 2 years ago

@mrsuciu you can download the dump file from here - https://www.sendthisfile.com/TygRfrMpWYtdCPffIJhFfgIG

mrsuciu commented 2 years ago

@vinaybr I had a quick look and from what I see the following is happening (stacks are hopefully partially correct since this is probably release code and I do not have access to your PDB's):

Apparent problem:

OS Thread Id: 0xa164 (15)
        Child SP               IP Call Site
000000B58DC7DEC8 00007ffb1de4d8c4 [GCFrame: 000000b58dc7dec8] 
000000B58DC7E038 00007ffb1de4d8c4 [GCFrame: 000000b58dc7e038] 
000000B58DC7E088 00007ffb1de4d8c4 [HelperMethodFrame_1OBJ: 000000b58dc7e088] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
000000B58DC7E1E0 00007ffa64f657b0 Opc.Ua.ServerStatusValue.OnRead_State(Opc.Ua.ISystemContext, Opc.Ua.NodeState, Opc.Ua.NumericRange, Opc.Ua.QualifiedName, System.Object ByRef, Opc.Ua.StatusCode ByRef, System.DateTime ByRef)
..........

Is deadlocking with thread:****

OS Thread Id: 0x9fbc (22)
        Child SP               IP Call Site
000000B58D37E7D8 00007ffb1de4d8c4 [GCFrame: 000000b58d37e7d8] 
000000B58D37E948 00007ffb1de4d8c4 [GCFrame: 000000b58d37e948] 
000000B58D37E998 00007ffb1de4d8c4 [HelperMethodFrame_1OBJ: 000000b58d37e998] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
000000B58D37EAF0 00007ffa64f67631 Opc.Ua.Server.CustomNodeManager2.Find(Opc.Ua.NodeId)
000000B58D37EB60 00007ffa64ece404 Opc.Ua.Com.Client.ComClientNodeManager.DoMetadataUpdate(System.Object)
....

Because:

The Sync blocks:

Index SyncBlock MonitorHeld Recursion Owning Thread Info SyncBlock Owner 55 000001FAA8EFA788 2925 1 000001FA8F460C50 9fbc 22 000001fa8f9da558 System.Object 57 000001FAA8EFA828 17 1 000001FAAB644000 a164 15 000001fa8fa2a1c0 System.Object

The sync objects trying to be aquired per thread:

OS Thread Id: 0x9fbc (22) RSP/REG Object Name 000000B58D37E7A8 000001fa8fa2a1c0 System.Object <=== trying to aquire

OS Thread Id: 0xa164 (15) RSP/REG Object Name 000000B58DC7DE40 000001fa8f6207d0 System.DateTime 000000B58DC7DE98 000001fa8f9da558 System.Object <=== trying to aquire

Probable cause:

The code in( or arround ) the following method: 000000B58D37EB60 00007ffa64ece404 Opc.Ua.Com.Client.ComClientNodeManager.DoMetadataUpdate(System.Object)

has allready aquired the lock trying to be aquired in the method: 000000B58DC7E1E0 00007ffa64f657b0 Opc.Ua.ServerStatusValue.OnRead_State(Opc.Ua.ISystemContext, Opc.Ua.NodeState, Opc.Ua.NumericRange, Opc.Ua.QualifiedName, System.Object ByRef, Opc.Ua.StatusCode ByRef, System.DateTime ByRef)

If this is the case, avoiding the lock in the first method should fix the issue. I have to emphasize that the stacks might not be 100% correct due to the lack of pdb's in my analysis.

vinaybr commented 2 years ago

@mrsuciu Thank you for your response DoMetadataUpdate method actually calls server.IsRunning property from OPC.Ua.Server assembly which acquires the lock. The same lock seems to be used elsewhere internally. I don't believe it to be an issue with the calling code but internal to OPC.UA.Server assembly. Your thoughts?

mrsuciu commented 2 years ago

@vinaybr I had a look at the code :

        private void DoMetadataUpdate(object state)
        {
            try
            {
                if (!Server.IsRunning)
                {
                    return;
                }

                ComClientManager system = (ComClientManager)SystemContext.SystemHandle;
                ComClient client = (ComClient)system.SelectClient(SystemContext, true);

                int[] availableLocales = client.QueryAvailableLocales();

                if (availableLocales != null)
                {
                    lock (Server.DiagnosticsLock)     <=== The lock acquired by thread 0x9fbc (22) 
                    {
                        // check if the server is running.
                        if (!Server.IsRunning)
                        {
                            return;
                        }

                        // get the LocaleIdArray property.
                        BaseVariableState localeArray = Server.DiagnosticsNodeManager.Find(Opc.Ua.VariableIds.Server_ServerCapabilities_LocaleIdArray) as BaseVariableState;
                        ............
        }

In my opinion the Server.DiagnosticsLock is the one that is acquired by thread 0x9fbc (22) and trying to be aquired by thread 0xa164 (15), since the thread 0x9fbc (22) is actually trying to execute the Opc.Ua.Server.CustomNodeManager2.Find(Opc.Ua.NodeId) method which means that the Server.IsRunning statements have already been executed and in turn have released their internal locks.

A short look through the code shows that

Server.DiagnosticsLock can be initialized through the following path: Server.DiagnosticsLock points to ServerInternalData.m_dataLock which can be passed in ServerStatusValue(ServerStatusState variable, ServerStatusDataType value, object dataLock): base(dataLock) and sets theBaseVariableValue.m_lock value.

The OnRead_State trying to be executed by thread 0xa164 (15):

private ServiceResult OnRead_State(
            ISystemContext context,
            NodeState node,
            NumericRange indexRange,
            QualifiedName dataEncoding,
            ref object value,
            ref StatusCode statusCode,
            ref DateTime timestamp)
        {
            lock (Lock) <=== BaseVariableValue.m_lock
            {
                DoBeforeReadProcessing(context, node);

                if (m_value != null)
                {
                    value = m_value.State;
                }

                return Read(context, node, indexRange, dataEncoding, ref value, ref statusCode, ref timestamp);
            }
        }

tries to aquire the same lock.

In short, if I am right, the 'lock (Server.DiagnosticsLock)' statement in the DoMetadataUpdate(object state) is contributing to the deadlock situation and the thousand threads are just a consequence of the internal lock inside the Server.IsRunning property not being released due to the deadlock.

mrsuciu commented 2 years ago

@vinaybr I searched and found the address of the actual Server.DiagnosticsLock which is exactly the one onwned by thread 0x9fbc (22): 00007ffa639c0af0 40000e7 98 System.Object 0 instance 000001fa8f9da558 m_dataLock

This confirms what I stated above.

Can you try if this forked version solves the deadlock issue ? https://github.com/mrsuciu/UA-.NETStandard-Samples/tree/git1635

vinaybr commented 2 years ago

@mrsuciu i will have a look and revert back. I will need to run it for a few days to see how it behaves over a period of time. Thanks for your reply.

vinaybr commented 2 years ago

@mrsuciu We still have the same issue but now the server runs for a couple of days before it fails. The callstack from the dump now shows the issue to be in the SubscribeRequestManager. We tried this a few times and the issue seems to be the same, The callstack is -

000000E2D7F7EAE8 00007ff9a616d8c4 [GCFrame: 000000e2d7f7eae8] 
000000E2D7F7EC58 00007ff9a616d8c4 [GCFrame: 000000e2d7f7ec58] 
000000E2D7F7ECA8 00007ff9a616d8c4 [HelperMethodFrame_1OBJ: 000000e2d7f7eca8] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
000000E2D7F7EE00 00007ff932812220 Opc.Ua.Com.Client.SubscribeRequestManager.Add(Opc.Ua.NodeState, Opc.Ua.Server.MonitoredItem)
000000E2D7F7EE60 00007ff9327faedc System.Threading.Tasks.Parallel+c__DisplayClass19_0`1[[System.__Canon, System.Private.CoreLib]].b__1(System.Threading.Tasks.RangeWorker ByRef, Int32, Boolean ByRef) [/_/src/System.Threading.Tasks.Parallel/src/System/Threading/Tasks/Parallel.cs @ 1213]
000000E2D7F7EF00 00007ff9327fa854 System.Threading.Tasks.TaskReplicator+Replica.Execute() [/_/src/System.Threading.Tasks.Parallel/src/System/Threading/Tasks/TaskReplicator.cs @ 98]
000000E2D7F7EF80 00007ff9327e0201 System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(System.Threading.Thread, System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 315]
000000E2D7F7EFD0 00007ff9327f96b3 System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2421]
000000E2D7F7F080 00007ff9327a6711 System.Threading.ThreadPoolWorkQueue.Dispatch() [/_/src/System.Private.CoreLib/shared/System/Threading/ThreadPool.cs @ 699]
000000E2D7F7F500 00007ff990db6c93 [DebuggerU2MCatchHandlerFrame: 000000e2d7f7f500] 

It looks like the system gets into this state during reconnects, i say this because there was another thread which was running for sometime and stuck with this callstack -

000000E2BC3BEBB8 00007ff9a616d8c4 [GCFrame: 000000e2bc3bebb8] 000000E2BC3BED28 00007ff9a616d8c4 [GCFrame: 000000e2bc3bed28] 000000E2BC3BED78 00007ff9a616d8c4 [HelperMethodFrame_1OBJ: 000000e2bc3bed78] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef) 000000E2BC3BEED0 00007ff9328caa8e Opc.Ua.Com.Client.SubscribeRequestManager.RecreateItems() 000000E2BC3BEF30 00007ff9328ca8fe Opc.Ua.Com.Client.ComDaClientNodeManager.OnServerReconnected(System.Object) 000000E2BC3BEFA0 00007ff9321ee843 Opc.Ua.Com.Client.ComClientManager.OnStatusTimerExpired(System.Object) 000000E2BC3BF000 00007ff9327e0201 System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(System.Threading.Thread, System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 315] 000000E2BC3BF050 00007ff93272e587 System.Threading.TimerQueueTimer.CallCallback(Boolean) [/_/src/System.Private.CoreLib/shared/System/Threading/Timer.cs @ 649] 000000E2BC3BF0B0 00007ff93272e381 System.Threading.TimerQueueTimer.Fire(Boolean) [/_/src/System.Private.CoreLib/shared/System/Threading/Timer.cs @ 610] 000000E2BC3BF120 00007ff9327a6728 System.Threading.ThreadPoolWorkQueue.Dispatch() [/_/src/System.Private.CoreLib/shared/System/Threading/ThreadPool.cs @ 699] 000000E2BC3BF5A0 00007ff990db6c93 [DebuggerU2MCatchHandlerFrame: 000000e2bc3bf5a0]

Please let us know if you have suggestions

mrsuciu commented 2 years ago

@vinaybr The situation you encountered could either be caused by my changes having not completely removed the previous deadlock or by another deadlock/issue which might have surfaced after the previous deadlock has been fixed, the later being the one which I would incline towards. Since the stacks you provided, both look as trying to acquire a synchronization resource, it would indicate a possible deadlock again (not necessarily between the the new stacks you mentioned above). Did the running time improve until this situation occurred ? Can you check if this is another deadlock state and identify the cause or can you provide the dump file with the new situation so I could try to identify it?

vinaybr commented 2 years ago

@mrsuciu The issue occurs randomly and its hard to say if running time has improved. We ran it on multiple systems and we saw it fail anywhere between 1 day to over 2 weeks. I am currently looking at the dump file to see if the locks can be handled better.

You can download the dumpfile from (download) incase you want to have a look. Any help you can offer would be welcome.

Thanks you again for your continuous help on this issue.

mrsuciu commented 2 years ago

@vinaybr I've had a look into the new dump and I can say the following:

Analysis of the dump file

There seems to be no obvious deadlock in this case, therefore the previous problem seems to have been fixed.

The lock object on which many other threads are waiting:

1847 0000027AA1D38DE8 2533 1 0000027AE478D890 6f38 48 0000027a88e70bc0 System.Object

The following thread is owning the lock on which many other threads are waiting:

OS Thread Id: 0x6f38 (48)
        Child SP               IP Call Site
000000E2BE7BEB00 00007ff9a616d8c4 [GCFrame: 000000e2be7beb00] 
000000E2BE7BEC48 00007ff9a616d8c4 [HelperMethodFrame_1OBJ: 000000e2be7bec48] System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object)
000000E2BE7BED70 00007ff93280b71a System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken) [/_/src/System.Private.CoreLib/shared/System/Threading/ManualResetEventSlim.cs @ 626]
000000E2BE7BEE10 00007ff9327fe75b System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2941]
000000E2BE7BEE80 00007ff9327fe529 System.Threading.Tasks.Task.InternalWaitCore(Int32, System.Threading.CancellationToken) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2827]
000000E2BE7BEF00 00007ff9327fe31f System.Threading.Tasks.Task.Wait(Int32, System.Threading.CancellationToken) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2794]
000000E2BE7BEF40 00007ff9327fcc23 System.Threading.Tasks.TaskReplicator.Run[[System.Threading.Tasks.RangeWorker, System.Threading.Tasks.Parallel]](ReplicatableUserAction`1<System.Threading.Tasks.RangeWorker>, System.Threading.Tasks.ParallelOptions, Boolean) [/_/src/System.Threading.Tasks.Parallel/src/System/Threading/Tasks/TaskReplicator.cs @ 135]
000000E2BE7BEFA0 00007ff9327fc79c System.Threading.Tasks.Parallel.ForWorker[[System.__Canon, System.Private.CoreLib]](Int32, Int32, System.Threading.Tasks.ParallelOptions, System.Action`1, System.Action`2<Int32,System.Threading.Tasks.ParallelLoopState>, System.Func`4<Int32,System.Threading.Tasks.ParallelLoopState,System.__Canon,System.__Canon>, System.Func`1<System.__Canon>, System.Action`1<System.__Canon>) [/_/src/System.Threading.Tasks.Parallel/src/System/Threading/Tasks/Parallel.cs @ 1259]
000000E2BE7BF060 00007ff9327fc35e System.Threading.Tasks.Parallel.ForEachWorker[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]](System.Collections.Generic.IEnumerable`1<System.__Canon>, System.Threading.Tasks.ParallelOptions, System.Action`1<System.__Canon>, System.Action`2<System.__Canon,System.Threading.Tasks.ParallelLoopState>, System.Action`3<System.__Canon,System.Threading.Tasks.ParallelLoopState,Int64>, System.Func`4<System.__Canon,System.Threading.Tasks.ParallelLoopState,System.__Canon,System.__Canon>, System.Func`5<System.__Canon,System.Threading.Tasks.ParallelLoopState,Int64,System.__Canon,System.__Canon>, System.Func`1<System.__Canon>, System.Action`1<System.__Canon>) [/_/src/System.Threading.Tasks.Parallel/src/System/Threading/Tasks/Parallel.cs @ 2139]
000000E2BE7BF130 00007ff9327fbd1f System.Threading.Tasks. Parallel.ForEach[[System.__Canon, System.Private.CoreLib]](System.Collections.Generic.IEnumerable`1<System.__Canon>, System.Action`1<System.__Canon>) [/_/src/System.Threading.Tasks.Parallel/src/System/Threading/Tasks/Parallel.cs @ 1549]
000000E2BE7BF1C0 00007ff9327e4138 Opc.Ua.Com.Client.SubscribeRequestManager.CreateItems(Opc.Ua.Server.ServerSystemContext, System.Collections.Generic.IList`1<Opc.Ua.Server.IMonitoredItem>)
000000E2BE7BF250 00007ff9328cab78 Opc.Ua.Com.Client.SubscribeRequestManager.RecreateItems()
000000E2BE7BF2B0 00007ff9328ca8fe Opc.Ua.Com.Client.ComDaClientNodeManager.OnServerReconnected(System.Object)
000000E2BE7BF320 00007ff9321ee843 Opc.Ua.Com.Client.ComClientManager.OnStatusTimerExpired(System.Object)
000000E2BE7BF380 00007ff93272ab0c System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 201]
000000E2BE7BF3F0 00007ff93272e5a8 System.Threading.TimerQueueTimer.CallCallback(Boolean) [/_/src/System.Private.CoreLib/shared/System/Threading/Timer.cs @ 649]
000000E2BE7BF450 00007ff93272e381 System.Threading.TimerQueueTimer.Fire(Boolean) [/_/src/System.Private.CoreLib/shared/System/Threading/Timer.cs @ 610]
000000E2BE7BF4C0 00007ff93272f22a System.Threading.TimerQueue.FireNextTimers() [/_/src/System.Private.CoreLib/shared/System/Threading/Timer.cs @ 286]
000000E2BE7BF910 00007ff990db6c93 [DebuggerU2MCatchHandlerFrame: 000000e2be7bf910]

Interestingly this thread had 0 user mode process time:

48:6f38 0 days 0:00:00.000

There are indeed 1155 threads with the following call stack, waiting for the lock owned by OS Thread Id: 0x6f38 (48):

000000E2E997E148 00007ff9a616d8c4 [GCFrame: 000000e2e997e148] 
000000E2E997E2B8 00007ff9a616d8c4 [GCFrame: 000000e2e997e2b8] 
000000E2E997E308 00007ff9a616d8c4 [HelperMethodFrame_1OBJ: 000000e2e997e308] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
000000E2E997E460 00007ff932812220 Opc.Ua.Com.Client.SubscribeRequestManager.Add(Opc.Ua.NodeState, Opc.Ua.Server.MonitoredItem)
000000E2E997E4C0 00007ff9327faedc System.Threading.Tasks.Parallel+c__DisplayClass19_0`1[[System.__Canon, System.Private.CoreLib]].b__1(System.Threading.Tasks.RangeWorker ByRef, Int32, Boolean ByRef) [/_/src/System.Threading.Tasks.Parallel/src/System/Threading/Tasks/Parallel.cs @ 1213]
000000E2E997E560 00007ff9327fa854 System.Threading.Tasks.TaskReplicator+Replica.Execute() [/_/src/System.Threading.Tasks.Parallel/src/System/Threading/Tasks/TaskReplicator.cs @ 98]
000000E2E997E5E0 00007ff9327e0201 System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(System.Threading.Thread, System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 315]
000000E2E997E630 00007ff9327f96b3 System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2421]
000000E2E997E6E0 00007ff9327a6711 System.Threading.ThreadPoolWorkQueue.Dispatch() [/_/src/System.Private.CoreLib/shared/System/Threading/ThreadPool.cs @ 699]
000000E2E997EB60 00007ff990db6c93 [DebuggerU2MCatchHandlerFrame: 000000e2e997eb60]

There are 103 threads with the following call stack (simmilar to thread 48) which are also waiting for the above mentioned lock object owned by OS Thread Id: 0x6f38 (48):

000000E2CADBF208 00007ff9a616d8c4 [GCFrame: 000000e2cadbf208] 
000000E2CADBF378 00007ff9a616d8c4 [GCFrame: 000000e2cadbf378] 
000000E2CADBF3C8 00007ff9a616d8c4 [HelperMethodFrame_1OBJ: 000000e2cadbf3c8] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
000000E2CADBF520 00007ff9328caa8e Opc.Ua.Com.Client.SubscribeRequestManager.RecreateItems()
000000E2CADBF580 00007ff9328ca8fe Opc.Ua.Com.Client.ComDaClientNodeManager.OnServerReconnected(System.Object)
000000E2CADBF5F0 00007ff9321ee843 Opc.Ua.Com.Client.ComClientManager.OnStatusTimerExpired(System.Object)
000000E2CADBF650 00007ff9327e0201 System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(System.Threading.Thread, System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 315]
000000E2CADBF6A0 00007ff93272e587 System.Threading.TimerQueueTimer.CallCallback(Boolean) [/_/src/System.Private.CoreLib/shared/System/Threading/Timer.cs @ 649]
000000E2CADBF700 00007ff93272e381 System.Threading.TimerQueueTimer.Fire(Boolean) [/_/src/System.Private.CoreLib/shared/System/Threading/Timer.cs @ 610]
000000E2CADBF770 00007ff9327a6728 System.Threading.ThreadPoolWorkQueue.Dispatch() [/_/src/System.Private.CoreLib/shared/System/Threading/ThreadPool.cs @ 699]
000000E2CADBFBF0 00007ff990db6c93 [DebuggerU2MCatchHandlerFrame: 000000e2cadbfbf0]

OS Thread Id: 0x69b0 (13) and OS Thread Id: 0x6960 (11) are processing the request queues and end up also calling the same instance of Opc.Ua.Com.Client.SubscribeRequestManager as thread 48 does. Their stack are the same (here goes stack for thread 11):

OS Thread Id: 0x6960 (11)
        Child SP               IP Call Site
000000E2BE1BD8B8 00007ff9a616d8c4 [GCFrame: 000000e2be1bd8b8] 
000000E2BE1BDA28 00007ff9a616d8c4 [GCFrame: 000000e2be1bda28] 
000000E2BE1BDA78 00007ff9a616d8c4 [HelperMethodFrame_1OBJ: 000000e2be1bda78] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
000000E2BE1BDBD0 00007ff932812220 Opc.Ua.Com.Client.SubscribeRequestManager.Add(Opc.Ua.NodeState, Opc.Ua.Server.MonitoredItem)
000000E2BE1BDC30 00007ff9327faedc System.Threading.Tasks.Parallel+c__DisplayClass19_0`1[[System.__Canon, System.Private.CoreLib]].b__1(System.Threading.Tasks.RangeWorker ByRef, Int32, Boolean ByRef) [/_/src/System.Threading.Tasks.Parallel/src/System/Threading/Tasks/Parallel.cs @ 1213]
000000E2BE1BDCD0 00007ff9327fa854 System.Threading.Tasks.TaskReplicator+Replica.Execute() [/_/src/System.Threading.Tasks.Parallel/src/System/Threading/Tasks/TaskReplicator.cs @ 98]
000000E2BE1BDD50 00007ff93272ab0c System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 201]
000000E2BE1BDDC0 00007ff9327f9694 System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2421]
000000E2BE1BDE70 00007ff9327fb548 System.Threading.Tasks.ThreadPoolTaskScheduler.TryExecuteTaskInline(System.Threading.Tasks.Task, Boolean) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/ThreadPoolTaskScheduler.cs @ 85]
000000E2BE1BDEC0 00007ff9327fb466 System.Threading.Tasks.TaskScheduler.TryRunInline(System.Threading.Tasks.Task, Boolean) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/TaskScheduler.cs @ 177]
000000E2BE1BDF00 00007ff9327fce70 System.Threading.Tasks.Task.InternalRunSynchronously(System.Threading.Tasks.TaskScheduler, Boolean) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 1051]
000000E2BE1BDF50 00007ff9327fcc0b System.Threading.Tasks.TaskReplicator.Run[[System.Threading.Tasks.RangeWorker, System.Threading.Tasks.Parallel]](ReplicatableUserAction`1<System.Threading.Tasks.RangeWorker>, System.Threading.Tasks.ParallelOptions, Boolean) [/_/src/System.Threading.Tasks.Parallel/src/System/Threading/Tasks/TaskReplicator.cs @ 135]
000000E2BE1BDFB0 00007ff9327fc79c System.Threading.Tasks.Parallel.ForWorker[[System.__Canon, System.Private.CoreLib]](Int32, Int32, System.Threading.Tasks.ParallelOptions, System.Action`1, System.Action`2<Int32,System.Threading.Tasks.ParallelLoopState>, System.Func`4<Int32,System.Threading.Tasks.ParallelLoopState,System.__Canon,System.__Canon>, System.Func`1<System.__Canon>, System.Action`1<System.__Canon>) [/_/src/System.Threading.Tasks.Parallel/src/System/Threading/Tasks/Parallel.cs @ 1259]
000000E2BE1BE070 00007ff9327fc35e System.Threading.Tasks.Parallel.ForEachWorker[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]](System.Collections.Generic.IEnumerable`1<System.__Canon>, System.Threading.Tasks.ParallelOptions, System.Action`1<System.__Canon>, System.Action`2<System.__Canon,System.Threading.Tasks.ParallelLoopState>, System.Action`3<System.__Canon,System.Threading.Tasks.ParallelLoopState,Int64>, System.Func`4<System.__Canon,System.Threading.Tasks.ParallelLoopState,System.__Canon,System.__Canon>, System.Func`5<System.__Canon,System.Threading.Tasks.ParallelLoopState,Int64,System.__Canon,System.__Canon>, System.Func`1<System.__Canon>, System.Action`1<System.__Canon>) [/_/src/System.Threading.Tasks.Parallel/src/System/Threading/Tasks/Parallel.cs @ 2139]
000000E2BE1BE140 00007ff9327fbd1f System.Threading.Tasks.Parallel.ForEach[[System.__Canon, System.Private.CoreLib]](System.Collections.Generic.IEnumerable`1<System.__Canon>, System.Action`1<System.__Canon>) [/_/src/System.Threading.Tasks.Parallel/src/System/Threading/Tasks/Parallel.cs @ 1549]
000000E2BE1BE1D0 00007ff9327e4138 Opc.Ua.Com.Client.SubscribeRequestManager.CreateItems(Opc.Ua.Server.ServerSystemContext, System.Collections.Generic.IList`1<Opc.Ua.Server.IMonitoredItem>)
000000E2BE1BE260 00007ff9327e3f91 Opc.Ua.Com.Client.ComDaClientNodeManager.OnCreateMonitoredItemsComplete(Opc.Ua.Server.ServerSystemContext, System.Collections.Generic.IList`1<Opc.Ua.Server.IMonitoredItem>)
000000E2BE1BE2D0 00007ff9327972cd Opc.Ua.Server.CustomNodeManager2.CreateMonitoredItems(Opc.Ua.Server.OperationContext, UInt32, Double, Opc.Ua.TimestampsToReturn, System.Collections.Generic.IList`1<Opc.Ua.MonitoredItemCreateRequest>, System.Collections.Generic.IList`1<Opc.Ua.ServiceResult>, System.Collections.Generic.IList`1<Opc.Ua.MonitoringFilterResult>, System.Collections.Generic.IList`1<Opc.Ua.Server.IMonitoredItem>, Int64 ByRef)
000000E2BE1BE3A0 00007ff93278deef Opc.Ua.Server.MasterNodeManager.CreateMonitoredItems(Opc.Ua.Server.OperationContext, UInt32, Double, Opc.Ua.TimestampsToReturn, System.Collections.Generic.IList`1<Opc.Ua.MonitoredItemCreateRequest>, System.Collections.Generic.IList`1<Opc.Ua.ServiceResult>, System.Collections.Generic.IList`1<Opc.Ua.MonitoringFilterResult>, System.Collections.Generic.IList`1<Opc.Ua.Server.IMonitoredItem>)
000000E2BE1BE480 00007ff93278d446 Opc.Ua.Server.Subscription.CreateMonitoredItems(Opc.Ua.Server.OperationContext, Opc.Ua.TimestampsToReturn, Opc.Ua.MonitoredItemCreateRequestCollection, Opc.Ua.MonitoredItemCreateResultCollection ByRef, Opc.Ua.DiagnosticInfoCollection ByRef)
000000E2BE1BE5C0 00007ff93278cff6 Opc.Ua.Server.SubscriptionManager.CreateMonitoredItems(Opc.Ua.Server.OperationContext, UInt32, Opc.Ua.TimestampsToReturn, Opc.Ua.MonitoredItemCreateRequestCollection, Opc.Ua.MonitoredItemCreateResultCollection ByRef, Opc.Ua.DiagnosticInfoCollection ByRef)
000000E2BE1BE650 00007ff93278ccc9 Opc.Ua.Server.StandardServer.CreateMonitoredItems(Opc.Ua.RequestHeader, UInt32, Opc.Ua.TimestampsToReturn, Opc.Ua.MonitoredItemCreateRequestCollection, Opc.Ua.MonitoredItemCreateResultCollection ByRef, Opc.Ua.DiagnosticInfoCollection ByRef)
000000E2BE1BE720 00007ff93278ca67 Opc.Ua.SessionEndpoint.CreateMonitoredItems(Opc.Ua.IServiceRequest)
000000E2BE1BE7D0 00007ff9327dae16 Opc.Ua.EndpointBase+ProcessRequestAsyncResult.OnProcessRequest(System.Object)
000000E2BE1BE810 00007ff93274313a Opc.Ua.ServerBase+RequestQueue.OnProcessRequestQueue(System.Object)
000000E2BE1BE880 00007ff990741efd System.Threading.ThreadHelper.ThreadStart_Context(System.Object) [/_/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs @ 48]
000000E2BE1BE8B0 00007ff93272ab0c System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 201]
000000E2BE1BE920 00007ff990741e8c System.Threading.ThreadHelper.ThreadStart(System.Object) [/_/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs @ 76]
000000E2BE1BEB38 00007ff990db6c93 [GCFrame: 000000e2be1beb38] 
000000E2BE1BEDA0 00007ff990db6c93 [DebuggerU2MCatchHandlerFrame: 000000e2be1beda0] 

Their user mode processor times indicate they are fairly active:

11:6960     0 days 0:20:43.859
13:69b0     0 days 0:14:55.140

My guess so far is that the server component threads OS Thread Id: 0x69b0 (13) and OS Thread Id: 0x6960 (11) are busy with creating monitored items, process which ends up calling the same Opc.Ua.Com.Client.SubscribeRequestManager.CreateItems as the OS Thread Id: 0x6f38 (48) tries to call but this thread nr 48, never actually manages to execute since it cannot acquire the Opc.Ua.Com.Client.SubscribeRequestManager : m_monitoredItems dictionary resource and operate upon it.

It can also be that there is a race condition which somehow invalidates the Opc.Ua.Com.Client.SubscribeRequestManager : m_monitoredItems dictionary and leads to an apparent deadlock (ex: infinite loop on trying to operate upon it or invalid undefined state).

Anyhow the triggering of OnStatusTimerExpired callback on thread 48 and on another 103 threads indicates that the server is in trouble processing many requests and not being able to respond in a timely manner. The OnStatusTimerExpired method triggers the RecreateItems call when the following condition is true:

`if (m_lastStatusUpdate.AddMilliseconds(m_statusUpdateInterval*1.1) < DateTime.UtcNow)`

Since m_statusUpdateInterval is set to 10000 ms the calculated value is 11000 ms therefore if the timer triggers later than 11 seconds the reconnect process is triggered; from the above it looks that it has been triggered 103 times + 1(thread 48)

Possible race condition ?

After a brief look on the OnStatusTimerExpired code, the following looks a bit troublesome to me:

ComClientManager:line 321:

         // dispose existing client in the background in case it blocks.
                ThreadPool.QueueUserWorkItem(OnDisposeClient, m_defaultClient);

ComClientManager:line 335:

                // create a new client.
                m_defaultClient = CreateClient();
                m_defaultClient.CreateInstance();   

Correct me if I am wrong, but as I see it this might lead to a disposed m_defaultClient being used imediately after/during the time that statements at line 335 are executed since there is no guarantee that statement on line 321 is executed before the ones on line 335.

Possible solution to this situation

If the server is indeed being bogged down by processing many requests (ex create monitored items) and cannot handle this in a timely manner and in turn is not being able to update the status code at 10000 ms intervals due to lock contention issues at a high stress, I would try to increase the m_statusUpdateInterval value or the 1.1 constant multiplier to a higher value and see if this improves the runtime, or try to lower the pace at which requests are generated.

I would also look after possible race conditions which might lead also to an undefined behavior that resembles a deadlock.

vinaybr commented 2 years ago

@mrsuciu thank you for the detailed response. I have been working on something else last few days. I will share my updates ASAP.

mregen commented 2 years ago

@vinaybr Please close if the issue is resolved.