Azure / hpcpack

The repo to track public issues for Microsoft HPC Pack product.
MIT License
29 stars 11 forks source link

Microsoft Ticket 120100224002876: Communication issue between head and compute nodes #11

Open webstc2 opened 3 years ago

webstc2 commented 3 years ago

Problem Description

Raising ticket here per request from Sudarshan Govindaraghavan sugovi@microsoft.com of Microsoft Support.

Per Sudarshan's question in the most recent communication below. The LastLogRefresh is a scheduled task that is created via group policy and is running a command from the KiXtart application. I have replaced the actual path with "path_to_file" for berevity. "C:\windows\wkix32.exe" "path_to_file\LastLog.kix" /i /refresh

Most recent response from Sudarshan regarding this issue:


Hey Cody,

Thank you for your patience. I am sharing the current status on this situation. At this point, I have completed the analysis from my side and I’m waiting for HPC Engineering teams to comment on this behavior within your cluster. The snippets, details & the logs uploaded by you have been shared with the HPC dev team. Once they have some insights/actionable info I will reply back on this thread and update you. For Job 514, we are seeing a disconnect between the logging at CN and HN when it comes to the task that failed – Job 514, Task 14605 (See end of the analysis).

I wanted to check if this event - LastLogRefresh highlighted below is something that’s logged by your job or some other component within your infrastructure. I checked from HPC Pack perspective and this doesn’t look like something that’s logged by us.

UNCWHPCCD009 (CN which disconnected), UNCWHPCHD001 (Head node) Failed stage out: 3/11/2021 2:05:43 PM CST - 3/11/2021 2:05:43 PM CST => 3/11/2021 6:05:43 PM UTC - 3/11/2021 6:05:43 PM UTC

UNCWHPCCD009 Last 5 System Shutdowns –

2/12/2021 11:01:38 PM 1074 Information The process C:\windows\System32\svchost.exe (UNCWHPCCD009) has initiated the shutdown of computer UNCWHPCCD009 on behalf of user NT AUTHORITY\SYSTEM for the following reason: Other (Planned) Reason Code: 0x80000000 Shutdown Type: shutdown Comment: C

2/12/2021 10:47:19 PM 1074 Information The process C:\windows\System32\svchost.exe (UNCWHPCCD009) has initiated the shutdown of computer UNCWHPCCD009 on behalf of user NT AUTHORITY\SYSTEM for the following reason: Other (Planned) Reason Code: 0x80000000 Shutdown Type: shutdown Comment: C

2/12/2021 10:44:42 PM 1074 Information The process C:\windows\System32\svchost.exe (UNCWHPCCD009) has initiated the shutdown of computer UNCWHPCCD009 on behalf of user NT AUTHORITY\SYSTEM for the following reason: Other (Planned) Reason Code: 0x80000000 Shutdown Type: shutdown Comment: C

2/12/2021 10:41:45 PM 1074 Information The process C:\windows\System32\svchost.exe (UNCWHPCCD009) has initiated the shutdown of computer UNCWHPCCD009 on behalf of user NT AUTHORITY\SYSTEM for the following reason: Other (Planned) Reason Code: 0x80000000 Shutdown Type: shutdown Comment: C

2/12/2021 7:00:19 PM 1074 Information The process C:\windows\System32\svchost.exe (UNCWHPCCD009) has initiated the shutdown of computer UNCWHPCCD009 on behalf of user NT AUTHORITY\SYSTEM for the following reason: Other (Planned) Reason Code: 0x80000000 Shutdown Type: shutdown Comment: C

UNCWHPCCD009 logs – ..\logs_20210311\UNCWHPCCP009\LogFiles\Scheduler\HpcNodeManager_000000.log

D09, HpcNodeManager – Exits

   Line 54: i,02/10/2021 18:57:01.653, SrcFile="HpcTrace" SrcFunc="" SrcLine="0" Pid="5432" Tid="5456" TS="0x01d6ffde8401bdec" String1="Cosmos Logger is being closed" 
   Line 101: i,02/10/2021 19:15:46.091, SrcFile="HpcTrace" SrcFunc="" SrcLine="0" Pid="2260" Tid="3452" TS="0x01d6ffe122394b49" String1="Cosmos Logger is being closed" 
   Line 135: i,02/10/2021 20:37:00.044, SrcFile="HpcTrace" SrcFunc="" SrcLine="0" Pid="2440" Tid="3196" TS="0x01d6ffec7b53864e" String1="Cosmos Logger is being closed" 
   Line 162: i,02/10/2021 20:37:31.883, SrcFile="HpcTrace" SrcFunc="" SrcLine="0" Pid="1324" Tid="2108" TS="0x01d6ffec8e4dce01" String1="Cosmos Logger is being closed" 
   Line 1936: i,02/12/2021 12:59:09.096, SrcFile="HpcTrace" SrcFunc="" SrcLine="0" Pid="2836" Tid="2656" TS="0x01d7013eda30f886" String1="Cosmos Logger is being closed" 
   Line 2095: i,02/12/2021 18:45:46.506, SrcFile="HpcTrace" SrcFunc="" SrcLine="0" Pid="2692" Tid="3840" TS="0x01d7016f4669ff9e" String1="Cosmos Logger is being closed" 
   Line 2122: i,02/12/2021 18:49:19.951, SrcFile="HpcTrace" SrcFunc="" SrcLine="0" Pid="2408" Tid="3480" TS="0x01d7016fc5a2ff5c" String1="Cosmos Logger is being closed" 
   Line 2179: i,02/12/2021 19:00:19.976, SrcFile="HpcTrace" SrcFunc="" SrcLine="0" Pid="2428" Tid="3516" TS="0x01d701714f0af712" String1="Cosmos Logger is being closed" 
   Line 2244: i,02/12/2021 22:42:54.156, SrcFile="HpcTrace" SrcFunc="" SrcLine="0" Pid="2356" Tid="3564" TS="0x01d7019066c10cbd" String1="Cosmos Logger is being closed" 
   Line 2271: i,02/12/2021 22:46:34.287, SrcFile="HpcTrace" SrcFunc="" SrcLine="0" Pid="2564" Tid="3140" TS="0x01d70190e9f66ff1" String1="Cosmos Logger is being closed" 
   Line 2298: i,02/12/2021 22:50:11.492, SrcFile="HpcTrace" SrcFunc="" SrcLine="0" Pid="2496" Tid="3512" TS="0x01d701916b6d349a" String1="Cosmos Logger is being closed" 
   Line 2355: i,02/12/2021 23:01:38.942, SrcFile="HpcTrace" SrcFunc="" SrcLine="0" Pid="2412" Tid="3476" TS="0x01d70193052dc4ec" String1="Cosmos Logger is being closed"

D09, HpcNodeManager – Job start and ends –

   Line 2380: i,03/11/2021 19:28:08.977, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="2492" Tid="3972" TS="0x01d716acaaff7c4b" String1="Start Task JobId 514, TaskId 14438." 
   Line 2397: i,03/11/2021 19:28:49.545, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="2492" Tid="4544" TS="0x01d716acc32da96f" String1="Start Task JobId 514, TaskId 14293." 
   Line 2440: i,03/11/2021 19:52:44.286, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="2492" Tid="2140" TS="0x01d716b01a59b5e0" String1="Start Task JobId 514, TaskId 14294."

   Line 2395: i,03/11/2021 19:28:49.451, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="2492" Tid="5840" TS="0x01d716acc31f5b47" String1="Job 514, Task 14438, returned Exit Code 0" 
   Line 2438: i,03/11/2021 19:52:44.145, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="2492" Tid="876" TS="0x01d716b01a444112" String1="Job 514, Task 14293, returned Exit Code 1" 
   Line 2468: i,03/11/2021 20:05:24.540, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="2492" Tid="4880" TS="0x01d716b1df7f55b3" String1="Job 514, Task 14294, returned Exit Code 0"

Here is a execution time plot for lines between 19:28 and 20:05 (Line 2379 to Line 2469) from HpcNodeManager log on the compute node.

HN Logs for these events -

Node D09 for Task 14438 (..\logs_20210311\UNCWHPCHD001\HpcScheduler_AA_000104.log) –

Line 329502: i,03/11/2021 19:28:07.496, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="9112" TS="0x01d716acaa1d86d5" String1="[RC] Job 514, Task 14438, Resource 2393, Node UNCWHPCCD009, Task 514.14438 will start on node UNCWHPCCD009 with taskRequeueCount 0. JobDispatched = False" Line 329503: i,03/11/2021 19:28:07.496, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="9112" TS="0x01d716acaa1d86d5" String1="[RC] Resource 2393, Node UNCWHPCCD009, [StartDispatchTask] Node state changed to JobTaskDispatched." Line 330434: i,03/11/2021 19:28:09.291, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="9112" TS="0x01d716acab2f69b2" String1="[RC] Resource 2393, Node UNCWHPCCD009, [TaskRunning] Node state changed to TaskRunning." Line 330447: d,03/11/2021 19:28:09.322, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="9112" TS="0x01d716acab3424bc" String1="[RC] Job 514, Task 14438, Resource 2393, Node UNCWHPCCD009, [JobMonitor.OnJobTaskStartedOnNode] JobTask succeeded to start on node UNCWHPCCD009" Line 331711: d,03/11/2021 19:28:12.732, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="9112" TS="0x01d716acad3c7f88" String1="[RC] Job 514, Task 14438, Resource 0, [40] MonitorThread successfully run task monitor."
Line 690644: i,03/11/2021 19:28:49.456, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="9112" TS="0x01d716acc3200d3c" String1="[RC] Resource 2393, Node UNCWHPCCD009, [TaskClosed] Node state changed to ReadyForTask."

Node D09 for Task 14293 –

..\logs_20210311\UNCWHPCHD001\HpcScheduler_AA_000104.log –

Line 691273: i,03/11/2021 19:28:49.541, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8924" TS="0x01d716acc32d0560" String1="[RemotingCommunicator] Job 514, Task 14293, Resource 0, Node UNCWHPCCD009, NodeController.StartTask: Begin to start task 514.14293 on node UNCWHPCCD009" Line 691587: i,03/11/2021 19:28:49.561, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="14580" TS="0x01d716acc33012f1" String1="[RemotingCommunicator] Resource 0, Node UNCWHPCCD009, NodeController.EndCall: Start task 514.14293 succeeded" Line 691594: d,03/11/2021 19:28:49.562, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8924" TS="0x01d716acc33039b0" String1="[RC] Job 514, Task 14293, Resource 0, [TaskMonitor.UpdateTaskState] Task 514.14293 State Running previous State Dispatching" Line 691595: i,03/11/2021 19:28:49.562, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8924" TS="0x01d716acc33039b0" String1="[RC] Job 514, Task 14293, Resource 0, Task 514.14293 state Running onTaskStarted" Line 691596: d,03/11/2021 19:28:49.562, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8924" TS="0x01d716acc33039b0" String1="[RC] Job 514, Task 14293, Resource 0, [37] MonitorThread successfully run task monitor."

..\logs_20210311\UNCWHPCHD001\HpcScheduler_AA_000105.log –

Line 928209: i,03/11/2021 19:52:44.149, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8924" TS="0x01d716b01a44e44d" String1="[RC] Job 514, Task 14293, Resource 0, Task 514.14293' state is changed to Finished from Finishing. Requeue = False" Line 928210: d,03/11/2021 19:52:44.149, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8924" TS="0x01d716b01a44e44d" String1="[RC] Job 514, Task 14293, Resource 0, [TaskMonitor.UpdateTaskState] Task 514.14293 State Finished previous State Finishing" Line 928193: i,03/11/2021 19:52:44.148, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="3096" TS="0x01d716b01a44bd0a" String1="[RemotingCommunicator] Job 514, Task 14293, Resource 0, Node UNCWHPCCD009, NodeListener.OnTaskCompletion: Node UNCWHPCCD009 report task 514.14293 completed with code 1" Line 928195: i,03/11/2021 19:52:44.148, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="3096" TS="0x01d716b01a44bd0a" String1="[RemotingCommunicator] Job 514, Task 14293, Resource 0, Node UNCWHPCCD009, RemotingNodeCommunicator.OnComputeNodeTaskCompleted" Line 928254: d,03/11/2021 19:52:44.187, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="9112" TS="0x01d716b01a4ab095" String1="[RC] Job 514, Task 14293, Resource 0, [JobMonitor.OnTaskFinished] TaskFinished event received. Task state is Finished"

Node D09 for Task 14294 –

            ..\logs_20210311\UNCWHPCHD001\HpcScheduler_AA_000105.log

Line 928329: i,03/11/2021 19:52:44.239, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8924" TS="0x01d716b01a52a3f2" String1="[RemotingCommunicator] Job 514, Task 14294, Resource 0, Node UNCWHPCCD009, NodeController.StartTask: Begin to start task 514.14294 on node UNCWHPCCD009" Line 928353: d,03/11/2021 19:52:44.328, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8924" TS="0x01d716b01a603a45" String1="[RC] Job 514, Task 14294, Resource 0, Task 514.14294 processevent TaskStartedOnNode" Line 928357: i,03/11/2021 19:52:44.328, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8924" TS="0x01d716b01a603a45" String1="[RC] Job 514, Task 14294, Resource 0, Task 514.14294 state Running onTaskStarted"

..\logs_20210311\UNCWHPCHD001\HpcScheduler_AA_000106.log Line 44241: i,03/11/2021 20:05:24.538, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="4532" TS="0x01d716b1df7efa82" String1="[RemotingCommunicator] Job 514, Task 14294, Resource 0, Node UNCWHPCCD009, NodeListener.OnTaskCompletion: Node UNCWHPCCD009 report task 514.14294 completed with code 0" Line 44255: d,03/11/2021 20:05:24.539, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8924" TS="0x01d716b1df7f21b3" String1="[RC] Job 514, Task 14294, Resource 0, [TaskMonitor.UpdateTaskState] Task 514.14294 State Finished previous State Finishing" Line 44279: d,03/11/2021 20:05:24.573, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8924" TS="0x01d716b1df8451b8" String1="[RC] Job 514, Task 14294, Resource 0, [37] Task 514.14294 removed from global list"

Exception recorded on the Head node for Job 514. There is no mention of this task (14605] on the node manager log on D09. –

i,03/11/2021 20:05:43.558, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8932" TS="0x01d716b1ead5362c" String1="[RC] Job 514, Task 14605, Resource 2393, Node UNCWHPCCD009, [TaskMonitor.OnTaskStartedOnNode] Task monitor 514.14605 received an exception on TaskStartedOnNode from node UNCWHPCCD009 Exception System.ServiceModel.CommunicationObjectFaultedException: The communication object, System.ServiceModel.Channels.ClientFramingDuplexSessionChannel, cannot be used for communication because it is in the Faulted state.....Server stack trace: .. at System.ServiceModel.Channels.CommunicationObject.ThrowIfDisposedOrNotOpen().. at System.ServiceModel.Channels.OutputChannel.BeginSend(Message message, TimeSpan timeout, AsyncCallback callback, Object state).. at System.ServiceModel.Dispatcher.DuplexChannelBinder.BeginRequest(Message message, TimeSpan timeout, AsyncCallback callback, Object state).. at System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.StartSend(Boolean completedSynchronously).. at System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.FinishEnsureOpen(IAsyncResult result, Boolean completedSynchronously).. at System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.StartEnsureOpen(Boolean completedSynchronously).. at System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.StartEnsureInteractiveInit().. at System.ServiceModel.Channels.ServiceChannel.BeginCall(String action, Boolean oneway, ProxyOperationRuntime operation, Object[] ins, TimeSpan timeout, AsyncCallback callback, Object asyncState).. at System.ServiceModel.Channels.ServiceChannel.BeginCall(ServiceChannel channel, ProxyOperationRuntime operation, Object[] ins, AsyncCallback callback, Object asyncState).. at System.Threading.Tasks.TaskFactory1.FromAsyncImpl[TArg1,TArg2,TArg3](Func6 beginMethod, Func2 endFunction, Action1 endAction, TArg1 arg1, TArg2 arg2, TArg3 arg3, Object state, TaskCreationOptions creationOptions).. at System.ServiceModel.Channels.ServiceChannelProxy.TaskCreator.CreateTask(ServiceChannel channel, ProxyOperationRuntime operation, Object[] inputParameters).. at System.ServiceModel.Channels.ServiceChannelProxy.InvokeTaskService(IMethodCallMessage methodCall, ProxyOperationRuntime operation).. at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)....Exception rethrown at [0]: .. at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg).. at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type).. at Microsoft.Hpc.Activation.INodeManagerServiceAsync.StartTaskAsync(Int32 jobId, Int32 taskId, ProcessStartInfo startInfo).. at Microsoft.Hpc.Scheduler.Communicator.Remoting.NodeManagerServiceProxy.<>c__DisplayClass9_0.<b0>d.MoveNext()..--- End of stack trace from previous location where exception was thrown ---.. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw().. at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task).. at Microsoft.Hpc.WcfReliableClient`1.d9.MoveNext()..--- End of stack trace from previous location where exception was thrown ---.. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw().. at Microsoft.Hpc.WcfReliableClient1.<InvokeOperationWithRetryAsync>d__9.MoveNext()..--- End of stack trace from previous location where exception was thrown ---.. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw().. at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task).. at Microsoft.Hpc.WcfReliableClient1.d8.MoveNext()..--- End of stack trace from previous location where exception was thrown ---.. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw().. at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task).. at Microsoft.Hpc.Scheduler.Communicator.Remoting.NodeManagerServiceProxy.d9.MoveNext()" d,03/11/2021 20:05:43.559, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8932" TS="0x01d716b1ead55b5d" String1="[RC] Job 514, Task 14605, Resource 0, [TaskMonitor.UpdateTaskState] Task 514.14605 State Finishing previous State Dispatching" .. d,03/11/2021 20:05:43.596, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="9112" TS="0x01d716b1eadb00cf" String1="[RC] Job 514, Task 14605, Resource 0, [JobMonitor.OnTaskFinished] TaskFinished event received. Task state is Failed"

The machine is online at this time. I can see events in the system event log, Application Log and also within the Azure Diagnostics heart beat log. –

I don’t see any crashes related to Node Manager within application events. But there is a break between 7:47:06 PM and 8:13:26 PM on 3/11/2021

Log Name: Application Source: LastLogRefresh Date: 3/11/2021 7:47:06 PM Event ID: 2003 Task Category: None Level: Information Keywords: Classic User: N/A Computer: http://UNCWHPCCD009.mw.na.cat.com Description: The description for Event ID 2003 from source LastLogRefresh cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer. If the event originated on another computer, the display information had to be saved with the event. The following information was included with the event: LastLog completed successfully by UNCWHPCCD009\UNCWHPCCD009$

There was no mention of that task on the CN’s Node manager –

(Lines that ref Job 514 on HpcNodeManager from CN) Line 2381: i,03/11/2021 19:28:08.993, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="2492" Tid="3972" TS="0x01d716acab01dee7" String1="[NM] Check if job 514 task 14438 needs restart." Line 2395: i,03/11/2021 19:28:49.451, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="2492" Tid="5840" TS="0x01d716acc31f5b47" String1="Job 514, Task 14438, returned Exit Code 0" Line 2398: i,03/11/2021 19:28:49.545, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="2492" Tid="4544" TS="0x01d716acc32da96f" String1="[NM] Check if job 514 task 14293 needs restart." Line 2438: i,03/11/2021 19:52:44.145, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="2492" Tid="876" TS="0x01d716b01a444112" String1="Job 514, Task 14293, returned Exit Code 1" Line 2441: i,03/11/2021 19:52:44.286, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="2492" Tid="2140" TS="0x01d716b01a59b5e0" String1="[NM] Check if job 514 task 14294 needs restart." Line 2468: i,03/11/2021 20:05:24.540, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="2492" Tid="4880" TS="0x01d716b1df7f55b3" String1="Job 514, Task 14294, returned Exit Code 0" Line 2471: i,03/11/2021 20:05:43.634, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="2492" Tid="3472" TS="0x01d716b1eae0e5ae" String1="The head node requested a cancel of job 514"

However, on the HN we see this –

Line 44336: d,03/11/2021 20:05:24.637, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="9112" TS="0x01d716b1df8e15f2" String1="[RC] Job 514, Task 14605, Resource 0, About to start monitor task." Line 44337: i,03/11/2021 20:05:24.637, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="9112" TS="0x01d716b1df8e15f2" String1="[RC] Job 514, Task 14605, Resource 0, Start monitor task on thread 38" Line 44340: d,03/11/2021 20:05:24.637, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="9112" TS="0x01d716b1df8e15f2" String1="[RC] Job 514, Task 14605, Resource 0, Task Monitor for 514.14605 added to runningtaskmonitors" Line 44350: i,03/11/2021 20:05:24.645, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8932" TS="0x01d716b1df8f4e5d" String1="[RC] Job 514, Task 14605, Resource 2393, Node UNCWHPCCD009, Task 514.14605 will start on node UNCWHPCCD009 with taskRequeueCount 0. JobDispatched = True" Line 44352: d,03/11/2021 20:05:24.645, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8932" TS="0x01d716b1df8f4e5d" String1="[RC] Job 514, Task 14605, Resource 0, [38] MonitorThread successfully run task monitor." Line 44353: i,03/11/2021 20:05:24.645, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8932" TS="0x01d716b1df8f4e5d" String1="[RemotingCommunicator] Job 514, Task 14605, Resource 0, Node UNCWHPCCD009, RemotingNodeCommunicator.StartTask" Line 44354: i,03/11/2021 20:05:24.645, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8932" TS="0x01d716b1df8f4e5d" String1="[RemotingCommunicator] Job 514, Task 14605, Resource 0, Node UNCWHPCCD009, NodeController.StartTask: Begin to start task 514.14605 on node UNCWHPCCD009"

Even the monitor thread starts off on HN side. We get the failure ~20s later on HN side but there is no mention of it on CN’s logs.

Line 49780: d,03/11/2021 20:05:43.558, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8932" TS="0x01d716b1ead5362c" String1="[RC] Job 514, Task 14605, Resource 0, Task 514.14605 processevent TaskStartedOnNode" Line 49781: i,03/11/2021 20:05:43.558, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8932" TS="0x01d716b1ead5362c" String1="[RC] Job 514, Task 14605, Resource 2393, Node UNCWHPCCD009, [TaskMonitor.OnTaskStartedOnNode] Task monitor 514.14605 received an exception on TaskStartedOnNode from node UNCWHPCCD009 Exception System.ServiceModel.CommunicationObjectFaultedException: The communication object, System.ServiceModel.Channels.ClientFramingDuplexSessionChannel, cannot be used for communication because it is in the Faulted state.....Server stack trace: .. at System.ServiceModel.Channels.CommunicationObject.ThrowIfDisposedOrNotOpen().. at System.ServiceModel.Channels.OutputChannel.BeginSend(Message message, TimeSpan timeout, AsyncCallback callback, Object state).. at System.ServiceModel.Dispatcher.DuplexChannelBinder.BeginRequest(Message message, TimeSpan timeout, AsyncCallback callback, Object state).. at System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.StartSend(Boolean completedSynchronously).. at System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.FinishEnsureOpen(IAsyncResult result, Boolean completedSynchronously).. at System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.StartEnsureOpen(Boolean completedSynchronously).. at System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.StartEnsureInteractiveInit().. at System.ServiceModel.Channels.ServiceChannel.BeginCall(String action, Boolean oneway, ProxyOperationRuntime operation, Object[] ins, TimeSpan timeout, AsyncCallback callback, Object asyncState).. at System.ServiceModel.Channels.ServiceChannel.BeginCall(ServiceChannel channel, ProxyOperationRuntime operation, Object[] ins, AsyncCallback callback, Object asyncState).. at System.Threading.Tasks.TaskFactory1.FromAsyncImpl[TArg1,TArg2,TArg3](Func6 beginMethod, Func2 endFunction, Action1 endAction, TArg1 arg1, TArg2 arg2, TArg3 arg3, Object state, TaskCreationOptions creationOptions).. at System.ServiceModel.Channels.ServiceChannelP ...

Line 49789: d,03/11/2021 20:05:43.559, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8932" TS="0x01d716b1ead55b5d" String1="[RC] Job 514, Task 14605, Resource 0, [TaskMonitor.UpdateTaskState] Task 514.14605 State Failed previous State Finishing" Line 49790: i,03/11/2021 20:05:43.559, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8932" TS="0x01d716b1ead55b5d" String1="[RC] Job 514, Task 14605, Resource 0, Task 514.14605 state Failed onTaskEnded" Line 49791: d,03/11/2021 20:05:43.559, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8932" TS="0x01d716b1ead55b5d" String1="[RC] Job 514, Task 14605, Resource 0, [38] MonitorThread successfully run task monitor." Line 49795: d,03/11/2021 20:05:43.595, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8932" TS="0x01d716b1eadad9e9" String1="[RC] Job 514, Task 14605, Resource 0, About to stop monitor task." Line 49796: d,03/11/2021 20:05:43.595, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="8932" TS="0x01d716b1eadad9e9" String1="[RC] Job 514, Task 14605, Resource 0, [38] Task 514.14605 removed from global list" Line 49803: d,03/11/2021 20:05:43.596, SrcFile="HpcScheduler" SrcFunc="" SrcLine="0" Pid="3544" Tid="9112" TS="0x01d716b1eadb00cf" String1="[RC] Job 514, Task 14605, Resource 0, [JobMonitor.OnTaskFinished] TaskFinished event received. Task state is Failed" Other 2 instances – UNCWHPCCD013 (CN which disconnected): Failed stage out: 3/11/2021 2:01:00 PM CST - 3/11/2021 2:01:19 PM CST => 3/11/2021 6:01:00 PM UTC - 3/11/2021 6:01:19 PM UTC UNCWHPCCD014 (CN which disconnected): Failed stage out: 3/11/2021 2:00:54 PM CST - 3/11/2021 2:01:13 PM CST => 3/11/2021 6:00:54 PM UTC - 3/11/2021 6:01:13 PM UTC -- Regards, Sudarshan Support Escalation Engineer | Devices and Deployment | Microsoft Work Hours: 06:00 AM – 03:00 PM ET (Mon-Fri) | Office (US): +1 855 425 1603 x 81320
YutongSun commented 3 years ago

The HpcNodeManager service was hung for unknown reason.

webstc2 commented 3 years ago

Is there any effort being made to determine why it is getting hung? This is an important issue and is affecting the usability of the cluster.

jenrjones commented 2 years ago

I see that this issue is still open, and I am also running into the same issue. If I submit a job that contains 300 tasks, distributed across 20 compute nodes, somewhere between 5 and 10 percent of the tasks will fail, across all of the nodes without discrimination, but other tasks on the same nodes will finish without error. We are not seeing issues with the HpcNodeManager service. We are running HPC Pack 2019 version 6.0.7205.0.

Here is the error from Cluster Manager on one of the failed tasks (and it is always this same error):

System.ServiceModel.CommunicationObjectFaultedException: The communication object, System.ServiceModel.Channels.ClientFramingDuplexSessionChannel, cannot be used for communication because it is in the Faulted state.

Server stack trace: at System.ServiceModel.Channels.CommunicationObject.ThrowIfDisposedOrNotOpen() at System.ServiceModel.Channels.OutputChannel.BeginSend(Message message, TimeSpan timeout, AsyncCallback callback, Object state) at System.ServiceModel.Dispatcher.DuplexChannelBinder.BeginRequest(Message message, TimeSpan timeout, AsyncCallback callback, Object state) at System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.StartSend(Boolean completedSynchronously) at System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.FinishEnsureOpen(IAsyncResult result, Boolean completedSynchronously) at System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.StartEnsureOpen(Boolean completedSynchronously) at System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.FinishEnsureInteractiveInit(IAsyncResult result, Boolean completedSynchronously) at System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.StartEnsureInteractiveInit() at System.ServiceModel.Channels.ServiceChannel.BeginCall(String action, Boolean oneway, ProxyOperationRuntime operation, Object[] ins, TimeSpan timeout, AsyncCallback callback, Object asyncState) at System.ServiceModel.Channels.ServiceChannel.BeginCall(ServiceChannel channel, ProxyOperationRuntime operation, Object[] ins, AsyncCallback callback, Object asyncState) at System.Threading.Tasks.TaskFactory1.FromAsyncImpl[TArg1,TArg2,TArg3](Func6 beginMethod, Func2 endFunction, Action1 endAction, TArg1 arg1, TArg2 arg2, TArg3 arg3, Object state, TaskCreationOptions creationOptions) at System.ServiceModel.Channels.ServiceChannelProxy.TaskCreator.CreateTask(ServiceChannel channel, ProxyOperationRuntime operation, Object[] inputParameters) at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)

Exception rethrown at [0]: at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg) at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type) at Microsoft.Hpc.Activation.INodeManagerServiceAsync.StartJobAndTaskAsync(Int32 jobId, String userAccount, Byte[] cipherText, Byte[] iv, Int32 taskId, ProcessStartInfo startInfo) at Microsoft.Hpc.Scheduler.Communicator.Remoting.NodeManagerServiceProxy.<>c__DisplayClass7_0.<b0>d.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at Microsoft.Hpc.WcfReliableClient`1.d9.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at Microsoft.Hpc.WcfReliableClient1.<InvokeOperationWithRetryAsync>d__9.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.Hpc.WcfReliableClient1.d8.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.Hpc.Scheduler.Communicator.Remoting.NodeManagerServiceProxy.d7.MoveNext()

YutongSun commented 2 years ago

@jenrjones , this looks network connectivity issue. Could you collect the following logs and send the zip to hpcpack@microsoft.com to check? Please also indicate the failed job/task IDs.

jenrjones commented 2 years ago

Hello,

We opened a ticket with Microsoft Support to troubleshoot the issue. As we awaited their analysis, we continued our own troubleshooting, and discovered that the Windows image we were using, with more recent CIS benchmark settings than the previous image, was the culprit. We have not narrowed down which setting broke HPC communication, but when we launched the cluster on a slightly older image, the error went away. If we figure out which setting broke HPC communication, I will post it here.