Azure / hpcpack

The repo to track public issues for Microsoft HPC Pack product.
MIT License
29 stars 11 forks source link

Various job failures with Ubuntu 18.04 compute nodes shortly after booting up #19

Closed dterziev closed 3 years ago

dterziev commented 3 years ago

Problem Description

HPC Pack 2019 (6.0.7214) with Ubuntu 18.04 compute nodes repeatedly kills job tasks, fails the jobs or runs many instances of the same task shortly after compute nodes boot up. Same issue is observed with HPC Pack 2016 Update 2 with Ubuntu compute nodes.

Steps to Reproduce

Install HPC Pack 2019 refresh 6.0.7214 on a Windows 2012 R2 (using SQL Express 2017) and attach one or two Ubuntu 18.04 compute nodes and bring them Online, then shutdown all nodes.

  1. Start the headnode.
  2. Start 20 jobs with a single parametric sweep of 50 to 70 tasks each. Each task takes about 3-4 minutes to complete. The jobs will stay Queued because the linux compute nodes are not yet running.
  3. Start the linux compute nodes

Expected Results

All tasks from all jobs to successfully run once without failing.

Actual Results

Problem 1: The HPC node agent repeatedly starts and kills the tasks shortly after the compute node boots up (see the log below). After a few attempts some of the jobs continue running and complete successfully. Problem 2: Some of the jobs (Job 281 in the log below) fail with "Job failed to start on some nodes or some nodes became unreachable". Problem 3: Multiple instances of the same task are running at the same time on the same node (see job 281). This can lead to corrupting any state that this task may change while it is running. Problem 4: Some of the jobs (Job 280) fail because one of the tasks in the parametric sweep is reported to be running in HPC Job Manager long after it has completed. After the maximum run time for the task has elapsed, HPC cancels the task and that fails the job.

All these issues seem to happen shortly after a compute node boots up and are observed only with Linux compute nodes. It seems that the common theme is that in the first few minutes after the agent starts up, the node may become unreachable which forces the headnode to kill all tasks from the jobs currently running on that node.

Additional Logs

[08/23 09:41:52.579] 1574 info: Version: 2.5.0.0
[08/23 09:41:52.583] 1574 info: Cleaning up zombie processes
[08/23 09:41:52.601] 1574 info: Cleanup zombie result: 
Cleaning up tasks in CGroup...

[08/23 09:41:52.601] 1574 info: Check nvidia-smi and enable persistence mode for GPU.
[08/23 09:41:52.602] 1574 warning: Executing nvidia-smi -pm 1 2>/dev/null, error code 127
[08/23 09:41:52.604] 1574 warning: GPU metrics will not be collected.
[08/23 09:41:52.606] 1643 info: Monitoring thread created. Interval 1
[08/23 09:41:52.606] 1574 info: HostsFetchInterval not specified or invalid, use the default interval 300 seconds.
[08/23 09:41:52.609] 1574 info: Main: entering sleep loop.
[08/23 09:41:52.614] 1606 info: Opening at https://0.0.0.0:40002, result opened.
[08/23 09:41:52.633] 1643 warning: Failed to query Azure node metadata. Status code: 404. Remaining retry count: 4.
[08/23 09:41:52.639] 1645 info: ResolveServiceLocation> Resolved serviceLocation IP-AC130210 for SchedulerStatefulService
[08/23 09:41:52.639] 1645 info: Update tasks' statistics.
[08/23 09:41:53.646] 1608 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:53.646] 1607 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:53.649] 1603 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:53.652] 1611 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:53.653] 1604 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:53.720] 1629 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:53.722] 1632 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:53.724] 1633 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:53.727] 1638 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:53.733] 1630 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:53.758] 1605 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:53.778] 1614 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:53.796] 1604 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:53.815] 1626 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:53.821] 1616 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:53.826] 1613 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:53.833] 1609 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:53.888] 1607 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:53.890] 1637 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:53.891] 1611 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:54.021] 1633 info: Job 275, Task 9341.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:54.021] 1619 info: Job 267, Task 9291.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:54.021] 1638 info: Job 266, Task 9337.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:54.021] 1634 info: Job 274, Task 9292.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:54.021] 1622 info: Job 272, Task 9290.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:54.030] 1635 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:54.030] 1635 info: Job 275, Task 9341.0: Change requeue count from 0 to 0, processKey 9341
[08/23 09:41:54.030] 1640 info: Replied with content null
[08/23 09:41:54.034] 1612 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:54.034] 1612 info: Job 266, Task 9337.0: Change requeue count from 0 to 0, processKey 9337
[08/23 09:41:54.034] 1612 info: Replied with content null
[08/23 09:41:54.039] 1617 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:54.039] 1617 info: Job 272, Task 9290.0: Change requeue count from 0 to 0, processKey 9290
[08/23 09:41:54.039] 1639 info: Replied with content null
[08/23 09:41:54.043] 1622 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:54.043] 1622 info: Job 274, Task 9292.0: Change requeue count from 0 to 0, processKey 9292
[08/23 09:41:54.044] 1629 info: Replied with content null
[08/23 09:41:54.048] 1633 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:54.048] 1633 info: Job 267, Task 9291.0: Change requeue count from 0 to 0, processKey 9291
[08/23 09:41:54.049] 1632 info: Replied with content null
[08/23 09:41:54.083] 1615 info: Job 283, Task 9724.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:54.083] 1607 info: Job 279, Task 9809.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:54.083] 1634 info: Job 277, Task 9723.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:54.083] 1613 info: Job 269, Task 9810.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:54.083] 1606 info: Job 268, Task 9597.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:54.088] 1621 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:54.088] 1621 info: Job 283, Task 9724.0: Change requeue count from 0 to 0, processKey 9724
[08/23 09:41:54.089] 1611 info: Replied with content null
[08/23 09:41:54.093] 1610 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:54.093] 1610 info: Job 277, Task 9723.0: Change requeue count from 0 to 0, processKey 9723
[08/23 09:41:54.093] 1632 info: Replied with content null
[08/23 09:41:54.098] 1608 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:54.098] 1608 info: Job 268, Task 9597.0: Change requeue count from 0 to 0, processKey 9597
[08/23 09:41:54.098] 1607 info: Replied with content null
[08/23 09:41:54.103] 1637 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:54.103] 1637 info: Job 279, Task 9809.0: Change requeue count from 0 to 0, processKey 9809
[08/23 09:41:54.104] 1616 info: Replied with content null
[08/23 09:41:54.109] 1618 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:54.109] 1618 info: Job 269, Task 9810.0: Change requeue count from 0 to 0, processKey 9810
[08/23 09:41:54.110] 1635 info: Replied with content null
[08/23 09:41:54.113] 2193 info: Started reading pipe thread
[08/23 09:41:54.114] 2200 info: Started reading pipe thread
[08/23 09:41:54.116] 2211 info: Started reading pipe thread
[08/23 09:41:54.116] 2214 info: Started reading pipe thread
[08/23 09:41:54.116] 2216 info: Started reading pipe thread
[08/23 09:41:54.130] 1634 info: Job 276, Task 10119.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:54.130] 1638 info: Job 285, Task 9991.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:54.134] 2366 info: Started reading pipe thread
[08/23 09:41:54.137] 1604 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:54.137] 1604 info: Job 276, Task 10119.0: Change requeue count from 0 to 0, processKey 10119
[08/23 09:41:54.138] 1618 info: Replied with content null
[08/23 09:41:54.143] 2427 info: Started reading pipe thread
[08/23 09:41:54.144] 1605 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:54.144] 1605 info: Job 285, Task 9991.0: Change requeue count from 0 to 0, processKey 9991
[08/23 09:41:54.145] 1626 info: Replied with content null
[08/23 09:41:54.147] 2448 info: Started reading pipe thread
[08/23 09:41:54.157] 2514 info: Started reading pipe thread
[08/23 09:41:54.158] 2525 info: Started reading pipe thread
[08/23 09:41:54.161] 1630 info: Job 273, Task 10033.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:54.161] 1604 info: Job 271, Task 10166.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:54.170] 1630 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:54.170] 1630 info: Job 273, Task 10033.0: Change requeue count from 0 to 0, processKey 10033
[08/23 09:41:54.171] 1639 info: Replied with content null
[08/23 09:41:54.176] 1604 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:54.176] 1604 info: Job 271, Task 10166.0: Change requeue count from 0 to 0, processKey 10166
[08/23 09:41:54.176] 1603 info: Replied with content null
[08/23 09:41:54.192] 1602 info: Job 278, Task 10323.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:54.192] 1615 info: Job 282, Task 9658.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:54.192] 1630 info: Job 281, Task 9731.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:54.197] 2733 info: Started reading pipe thread
[08/23 09:41:54.197] 2735 info: Started reading pipe thread
[08/23 09:41:54.197] 1616 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:54.197] 1616 info: Job 278, Task 10323.0: Change requeue count from 0 to 0, processKey 10323
[08/23 09:41:54.198] 1610 info: Replied with content null
[08/23 09:41:54.203] 1637 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:54.203] 1637 info: Job 281, Task 9731.0: Change requeue count from 0 to 0, processKey 9731
[08/23 09:41:54.204] 1635 info: Replied with content null
[08/23 09:41:54.211] 1611 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:54.211] 1611 info: Job 282, Task 9658.0: Change requeue count from 0 to 0, processKey 9658
[08/23 09:41:54.211] 1613 info: Replied with content null
[08/23 09:41:54.215] 2843 info: Started reading pipe thread
[08/23 09:41:54.222] 2883 info: Started reading pipe thread
[08/23 09:41:54.245] 3013 info: Started reading pipe thread
[08/23 09:41:54.250] 3028 info: Started reading pipe thread
[08/23 09:41:54.255] 1624 info: Job 280, Task 10238.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:54.255] 1619 info: Job 270, Task 10404.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:54.255] 1641 info: Job 284, Task 10443.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:54.258] 3055 info: Started reading pipe thread
[08/23 09:41:54.260] 1624 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:54.260] 1624 info: Job 284, Task 10443.0: Change requeue count from 0 to 0, processKey 10443
[08/23 09:41:54.260] 1610 info: Replied with content null
[08/23 09:41:54.266] 1630 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:54.266] 1630 info: Job 270, Task 10404.0: Change requeue count from 0 to 0, processKey 10404
[08/23 09:41:54.267] 1605 info: Replied with content null
[08/23 09:41:54.272] 1602 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:54.272] 1602 info: Job 280, Task 10238.0: Change requeue count from 0 to 0, processKey 10238
[08/23 09:41:54.272] 1614 info: Replied with content null
[08/23 09:41:54.311] 3326 info: Started reading pipe thread
[08/23 09:41:54.313] 3328 info: Started reading pipe thread
[08/23 09:41:54.322] 3363 info: Started reading pipe thread
[08/23 09:41:55.607] 1644 info: ResolveServiceLocation> Resolved serviceLocation IP-AC130210 for SchedulerStatefulService
[08/23 09:41:55.632] 1646 info: ResolveServiceLocation> Resolved serviceLocation IP-AC130210 for MonitoringStatefulService
[08/23 09:41:55.632] 1646 info: UdpReporter, getaddrinfo server IP-AC130210, port 9894
[08/23 09:41:55.632] 1646 info: UdpReporter, connect succeeds.
[08/23 09:41:55.639] 1647 info: ResolveServiceLocation> Resolved serviceLocation IP-AC130210 for SchedulerStatefulService
[08/23 09:41:55.644] 1647 info: Hosts file manager: hosts file update received with update Id 2552c29c-b78b-4be5-921d-7d310decaf45
[08/23 09:41:55.644] 1647 info: Hosts file manager: update local hosts file /etc/hosts
[08/23 09:41:55.704] 1631 info: Request: Uri /api/IP-172-19-3-105/metric
[08/23 09:41:55.704] 1637 info: Request: Uri /api/IP-172-19-3-105/metricconfig
[08/23 09:41:55.705] 1616 info: Replied with content null
[08/23 09:41:55.713] 1603 info: ResolveServiceLocation> Resolved serviceLocation IP-AC130210 for SchedulerStatefulService
[08/23 09:41:55.713] 1617 info: Replied with content null
[08/23 09:41:55.993] 1623 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.000] 1626 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.000] 1617 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.000] 1607 info: Job 282, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.000] 1607 info: Job 282, Task 999.999: EndJob: starting
[08/23 09:41:56.019] 2822 info: Job 282, Task 9658.0: Process 3054: exit code 137
[08/23 09:41:56.021] 1616 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.021] 1635 info: Job 279, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.034] 1621 info: Job 272, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.041] 1607 info: Job 282, Task 999.999: EndJob: ended {"JobId":282,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":40,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9658,"TaskRequeueCount":0,"UserProcessorTime":660,"WorkingSet":26308}]}
[08/23 09:41:56.041] 1607 info: Job 282, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:56.041] 1607 info: Job 282, Task 999.999: EndJob: 19 jobs associated with the user root
[08/23 09:41:56.041] 1635 info: Job 279, Task 999.999: EndJob: starting
[08/23 09:41:56.042] 1603 info: Replied with content {"JobId":282,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":40,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9658,"TaskRequeueCount":0,"UserProcessorTime":660,"WorkingSet":26308}]}
[08/23 09:41:56.042] 1615 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.042] 1638 info: Job 275, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.042] 1610 info: Job 274, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.048] 1634 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.048] 1630 info: Job 266, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.057] 2111 info: Job 279, Task 9809.0: Process 2512: exit code 137
[08/23 09:41:56.077] 1608 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.077] 1624 info: Job 267, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.080] 1635 info: Job 279, Task 999.999: EndJob: ended {"JobId":279,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":80,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9809,"TaskRequeueCount":0,"UserProcessorTime":660,"WorkingSet":26504}]}
[08/23 09:41:56.080] 1635 info: Job 279, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:56.080] 1635 info: Job 279, Task 999.999: EndJob: 18 jobs associated with the user root
[08/23 09:41:56.080] 1621 info: Job 272, Task 999.999: EndJob: starting
[08/23 09:41:56.080] 1635 info: Replied with content {"JobId":279,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":80,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9809,"TaskRequeueCount":0,"UserProcessorTime":660,"WorkingSet":26504}]}
[08/23 09:41:56.082] 1603 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.082] 1612 info: Job 277, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.089] 1605 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.089] 1614 info: Job 276, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.092] 1607 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.092] 1615 info: Job 283, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.096] 1604 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.096] 1625 info: Job 269, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.097] 1905 info: Job 272, Task 9290.0: Process 2213: exit code 137
[08/23 09:41:56.118] 1641 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.118] 1639 info: Job 273, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.120] 1603 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.120] 1617 info: Job 280, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.122] 1619 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.123] 1609 info: Job 268, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.127] 1619 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.127] 1637 info: Job 285, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.154] 1626 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.154] 1636 info: Job 278, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.157] 1608 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.158] 1606 info: Job 271, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.161] 1604 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.162] 1631 info: Job 281, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.164] 1621 info: Job 272, Task 999.999: EndJob: ended {"JobId":272,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":50,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9290,"TaskRequeueCount":0,"UserProcessorTime":660,"WorkingSet":26912}]}
[08/23 09:41:56.164] 1621 info: Job 272, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:56.164] 1621 info: Job 272, Task 999.999: EndJob: 17 jobs associated with the user root
[08/23 09:41:56.164] 1621 info: Replied with content {"JobId":272,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":50,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9290,"TaskRequeueCount":0,"UserProcessorTime":660,"WorkingSet":26912}]}
[08/23 09:41:56.164] 1638 info: Job 275, Task 999.999: EndJob: starting
[08/23 09:41:56.181] 1886 info: Job 275, Task 9341.0: Process 2208: exit code 137
[08/23 09:41:56.184] 1611 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.185] 1632 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.185] 1603 info: Job 270, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.185] 1633 info: Job 284, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.249] 1638 info: Job 275, Task 999.999: EndJob: ended {"JobId":275,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":60,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9341,"TaskRequeueCount":0,"UserProcessorTime":670,"WorkingSet":27560}]}
[08/23 09:41:56.249] 1638 info: Job 275, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:56.249] 1638 info: Job 275, Task 999.999: EndJob: 16 jobs associated with the user root
[08/23 09:41:56.249] 1610 info: Job 274, Task 999.999: EndJob: starting
[08/23 09:41:56.250] 1616 info: Replied with content {"JobId":275,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":60,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9341,"TaskRequeueCount":0,"UserProcessorTime":670,"WorkingSet":27560}]}
[08/23 09:41:56.266] 1920 info: Job 274, Task 9292.0: Process 2194: exit code 137
[08/23 09:41:56.289] 1610 info: Job 274, Task 999.999: EndJob: ended {"JobId":274,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":50,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9292,"TaskRequeueCount":0,"UserProcessorTime":670,"WorkingSet":26312}]}
[08/23 09:41:56.289] 1610 info: Job 274, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:56.289] 1610 info: Job 274, Task 999.999: EndJob: 15 jobs associated with the user root
[08/23 09:41:56.289] 1630 info: Job 266, Task 999.999: EndJob: starting
[08/23 09:41:56.290] 1613 info: Replied with content {"JobId":274,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":50,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9292,"TaskRequeueCount":0,"UserProcessorTime":670,"WorkingSet":26312}]}
[08/23 09:41:56.376] 1895 info: Job 266, Task 9337.0: Process 2190: exit code 137
[08/23 09:41:56.396] 1604 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:56.397] 1605 info: Job 267, Task 9291.1: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:56.399] 1630 info: Job 266, Task 999.999: EndJob: ended {"JobId":266,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":50,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9337,"TaskRequeueCount":0,"UserProcessorTime":290,"WorkingSet":14928}]}
[08/23 09:41:56.399] 1630 info: Job 266, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:56.399] 1630 info: Job 266, Task 999.999: EndJob: 14 jobs associated with the user root
[08/23 09:41:56.399] 1624 info: Job 267, Task 999.999: EndJob: starting
[08/23 09:41:56.399] 1630 info: Replied with content {"JobId":266,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":50,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9337,"TaskRequeueCount":0,"UserProcessorTime":290,"WorkingSet":14928}]}
[08/23 09:41:56.416] 1938 info: Job 267, Task 9291.0: Process 2198: exit code 137
[08/23 09:41:56.438] 1618 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:56.438] 1627 info: Job 283, Task 9724.1: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:56.442] 1616 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:56.443] 1630 info: Job 269, Task 9810.1: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:56.446] 1620 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:56.447] 1640 info: Job 285, Task 9991.1: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:56.452] 1623 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:56.453] 1619 info: Job 268, Task 9597.1: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:56.473] 1624 info: Job 267, Task 999.999: EndJob: ended {"JobId":267,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":20,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9291,"TaskRequeueCount":0,"UserProcessorTime":330,"WorkingSet":14504}]}
[08/23 09:41:56.473] 1624 info: Job 267, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:56.473] 1624 info: Job 267, Task 999.999: EndJob: 13 jobs associated with the user root
[08/23 09:41:56.473] 1612 info: Job 277, Task 999.999: EndJob: starting
[08/23 09:41:56.474] 1621 info: Replied with content {"JobId":267,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":20,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9291,"TaskRequeueCount":0,"UserProcessorTime":330,"WorkingSet":14504}]}
[08/23 09:41:56.474] 1626 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:56.474] 1626 info: Job 281, Task 9731.1: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:56.483] 1620 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:56.483] 1604 info: Job 277, Task 9723.1: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:56.490] 2055 info: Job 277, Task 9723.0: Process 2423: exit code 137
[08/23 09:41:56.513] 1628 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:56.513] 1608 info: Job 276, Task 10119.1: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:56.528] 1621 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:56.529] 1624 info: Job 280, Task 10238.1: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:56.559] 1607 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:56.559] 1604 info: Job 270, Task 10404.1: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:56.563] 1628 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.563] 1638 info: Job 267, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.589] 1612 info: Job 277, Task 999.999: EndJob: ended {"JobId":277,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":50,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9723,"TaskRequeueCount":0,"UserProcessorTime":680,"WorkingSet":27180}]}
[08/23 09:41:56.589] 1612 info: Job 277, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:56.589] 1612 info: Job 277, Task 999.999: EndJob: 12 jobs associated with the user root
[08/23 09:41:56.589] 1614 info: Job 276, Task 999.999: EndJob: starting
[08/23 09:41:56.590] 1623 info: Replied with content {"JobId":277,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":50,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9723,"TaskRequeueCount":0,"UserProcessorTime":680,"WorkingSet":27180}]}
[08/23 09:41:56.606] 2398 info: Job 276, Task 10119.0: Process 2732: exit code 137
[08/23 09:41:56.646] 1618 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:56.647] 1622 info: Job 267, Task 9291.2: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:56.668] 1614 info: Job 276, Task 999.999: EndJob: ended {"JobId":276,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":40,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":10119,"TaskRequeueCount":0,"UserProcessorTime":670,"WorkingSet":26204}]}
[08/23 09:41:56.668] 1614 info: Job 276, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:56.668] 1614 info: Job 276, Task 999.999: EndJob: 11 jobs associated with the user root
[08/23 09:41:56.669] 1614 info: Replied with content {"JobId":276,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":40,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":10119,"TaskRequeueCount":0,"UserProcessorTime":670,"WorkingSet":26204}]}
[08/23 09:41:56.669] 1615 info: Job 283, Task 999.999: EndJob: starting
[08/23 09:41:56.670] 1626 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.671] 1621 info: Job 277, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.686] 2046 info: Job 283, Task 9724.0: Process 2360: exit code 137
[08/23 09:41:56.708] 1615 info: Job 283, Task 999.999: EndJob: ended {"JobId":283,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":70,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9724,"TaskRequeueCount":0,"UserProcessorTime":660,"WorkingSet":26616}]}
[08/23 09:41:56.709] 1615 info: Job 283, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:56.709] 1615 info: Job 283, Task 999.999: EndJob: 10 jobs associated with the user root
[08/23 09:41:56.709] 1625 info: Job 269, Task 999.999: EndJob: starting
[08/23 09:41:56.709] 1641 info: Replied with content {"JobId":283,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":70,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9724,"TaskRequeueCount":0,"UserProcessorTime":660,"WorkingSet":26616}]}
[08/23 09:41:56.754] 1629 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.754] 1615 info: Job 276, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.765] 2166 info: Job 269, Task 9810.0: Process 2522: exit code 137
[08/23 09:41:56.767] 1602 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:56.767] 1614 info: Job 277, Task 9723.2: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:56.790] 1625 info: Job 269, Task 999.999: EndJob: ended {"JobId":269,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":50,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9810,"TaskRequeueCount":0,"UserProcessorTime":690,"WorkingSet":26468}]}
[08/23 09:41:56.790] 1625 info: Job 269, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:56.790] 1625 info: Job 269, Task 999.999: EndJob: 9 jobs associated with the user root
[08/23 09:41:56.790] 1625 info: Replied with content {"JobId":269,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":50,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9810,"TaskRequeueCount":0,"UserProcessorTime":690,"WorkingSet":26468}]}
[08/23 09:41:56.790] 1639 info: Job 273, Task 999.999: EndJob: starting
[08/23 09:41:56.795] 1608 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.796] 1618 info: Job 283, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.807] 2600 info: Job 273, Task 10033.0: Process 2842: exit code 137
[08/23 09:41:56.832] 1639 info: Job 273, Task 999.999: EndJob: ended {"JobId":273,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":40,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":10033,"TaskRequeueCount":0,"UserProcessorTime":680,"WorkingSet":26476}]}
[08/23 09:41:56.832] 1639 info: Job 273, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:56.832] 1639 info: Job 273, Task 999.999: EndJob: 8 jobs associated with the user root
[08/23 09:41:56.832] 1617 info: Job 280, Task 999.999: EndJob: starting
[08/23 09:41:56.832] 1629 info: Replied with content {"JobId":273,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":40,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":10033,"TaskRequeueCount":0,"UserProcessorTime":680,"WorkingSet":26476}]}
[08/23 09:41:56.840] 1641 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:56.841] 1607 info: Job 276, Task 10119.2: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:56.868] 1629 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.868] 1613 info: Job 269, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:56.888] 3129 info: Job 280, Task 10238.0: Process 3362: exit code 137
[08/23 09:41:56.910] 1617 info: Job 280, Task 999.999: EndJob: ended {"JobId":280,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":40,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":10238,"TaskRequeueCount":0,"UserProcessorTime":670,"WorkingSet":25636}]}
[08/23 09:41:56.910] 1617 info: Job 280, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:56.910] 1617 info: Job 280, Task 999.999: EndJob: 7 jobs associated with the user root
[08/23 09:41:56.910] 1609 info: Job 268, Task 999.999: EndJob: starting
[08/23 09:41:56.910] 1617 info: Replied with content {"JobId":280,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":40,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":10238,"TaskRequeueCount":0,"UserProcessorTime":670,"WorkingSet":25636}]}
[08/23 09:41:56.912] 1641 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:56.912] 1604 info: Job 283, Task 9724.2: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:56.927] 2065 info: Job 268, Task 9597.0: Process 2442: exit code 137
[08/23 09:41:56.951] 1609 info: Job 268, Task 999.999: EndJob: ended {"JobId":268,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":50,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9597,"TaskRequeueCount":0,"UserProcessorTime":680,"WorkingSet":26084}]}
[08/23 09:41:56.951] 1609 info: Job 268, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:56.951] 1609 info: Job 268, Task 999.999: EndJob: 6 jobs associated with the user root
[08/23 09:41:56.951] 1639 info: Replied with content {"JobId":268,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":50,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9597,"TaskRequeueCount":0,"UserProcessorTime":680,"WorkingSet":26084}]}
[08/23 09:41:56.951] 1626 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:56.951] 1637 info: Job 285, Task 999.999: EndJob: starting
[08/23 09:41:56.952] 1614 info: Job 269, Task 9810.2: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:56.986] 1617 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:56.986] 1629 info: Job 280, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:57.020] 2428 info: Job 285, Task 9991.0: Process 2730: exit code 137
[08/23 09:41:57.034] 1617 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:57.034] 1612 info: Job 268, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:57.044] 1637 info: Job 285, Task 999.999: EndJob: ended {"JobId":285,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":50,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9991,"TaskRequeueCount":0,"UserProcessorTime":680,"WorkingSet":26192}]}
[08/23 09:41:57.044] 1637 info: Job 285, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:57.044] 1637 info: Job 285, Task 999.999: EndJob: 5 jobs associated with the user root
[08/23 09:41:57.045] 1636 info: Job 278, Task 999.999: EndJob: starting
[08/23 09:41:57.045] 1637 info: Replied with content {"JobId":285,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":50,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9991,"TaskRequeueCount":0,"UserProcessorTime":680,"WorkingSet":26192}]}
[08/23 09:41:57.061] 2736 info: Job 278, Task 10323.0: Process 3011: exit code 137
[08/23 09:41:57.081] 1641 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:57.081] 1639 info: Job 280, Task 10238.2: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:57.117] 1607 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:57.117] 1609 info: Job 285, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:57.129] 1636 info: Job 278, Task 999.999: EndJob: ended {"JobId":278,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":50,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":10323,"TaskRequeueCount":0,"UserProcessorTime":680,"WorkingSet":25800}]}
[08/23 09:41:57.129] 1636 info: Job 278, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:57.129] 1636 info: Job 278, Task 999.999: EndJob: 4 jobs associated with the user root
[08/23 09:41:57.129] 1641 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:57.129] 1606 info: Job 271, Task 999.999: EndJob: starting
[08/23 09:41:57.129] 1636 info: Replied with content {"JobId":278,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":50,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":10323,"TaskRequeueCount":0,"UserProcessorTime":680,"WorkingSet":25800}]}
[08/23 09:41:57.131] 1607 info: Job 268, Task 9597.2: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:57.147] 2643 info: Job 271, Task 10166.0: Process 2881: exit code 137
[08/23 09:41:57.200] 1632 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:57.200] 1639 info: Job 285, Task 9991.2: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:57.227] 1606 info: Job 271, Task 999.999: EndJob: ended {"JobId":271,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":60,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":10166,"TaskRequeueCount":0,"UserProcessorTime":670,"WorkingSet":25340}]}
[08/23 09:41:57.227] 1606 info: Job 271, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:57.227] 1606 info: Job 271, Task 999.999: EndJob: 3 jobs associated with the user root
[08/23 09:41:57.227] 1631 info: Job 281, Task 999.999: EndJob: starting
[08/23 09:41:57.227] 1606 info: Replied with content {"JobId":271,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":60,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":10166,"TaskRequeueCount":0,"UserProcessorTime":670,"WorkingSet":25340}]}
[08/23 09:41:57.244] 2774 info: Job 281, Task 9731.0: Process 3027: exit code 137
[08/23 09:41:57.267] 1631 info: Job 281, Task 999.999: EndJob: ended {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":40,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":0,"UserProcessorTime":690,"WorkingSet":25192}]}
[08/23 09:41:57.267] 1631 info: Job 281, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:57.268] 1631 info: Job 281, Task 999.999: EndJob: 2 jobs associated with the user root
[08/23 09:41:57.268] 1603 info: Job 270, Task 999.999: EndJob: starting
[08/23 09:41:57.268] 1604 info: Replied with content {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":40,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":0,"UserProcessorTime":690,"WorkingSet":25192}]}
[08/23 09:41:57.342] 1641 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:57.342] 1636 info: Job 281, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:57.344] 3106 info: Job 270, Task 10404.0: Process 3327: exit code 137
[08/23 09:41:57.367] 1603 info: Job 270, Task 999.999: EndJob: ended {"JobId":270,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":20,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":10404,"TaskRequeueCount":0,"UserProcessorTime":620,"WorkingSet":18892}]}
[08/23 09:41:57.367] 1603 info: Job 270, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:57.367] 1603 info: Job 270, Task 999.999: EndJob: 1 jobs associated with the user root
[08/23 09:41:57.367] 1633 info: Job 284, Task 999.999: EndJob: starting
[08/23 09:41:57.367] 1603 info: Replied with content {"JobId":270,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":20,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":10404,"TaskRequeueCount":0,"UserProcessorTime":620,"WorkingSet":18892}]}
[08/23 09:41:57.384] 3067 info: Job 284, Task 10443.0: Process 3324: exit code 137
[08/23 09:41:57.406] 1633 info: Job 284, Task 999.999: EndJob: ended {"JobId":284,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":40,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":10443,"TaskRequeueCount":0,"UserProcessorTime":800,"WorkingSet":24936}]}
[08/23 09:41:57.406] 1633 info: Job 284, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:57.406] 1633 info: Job 284, Task 999.999: EndJob: 0 jobs associated with the user root
[08/23 09:41:57.407] 1603 info: Replied with content {"JobId":284,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":40,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":10443,"TaskRequeueCount":0,"UserProcessorTime":800,"WorkingSet":24936}]}
[08/23 09:41:57.411] 1605 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:57.411] 1605 info: Job 267, Task 9291.1: Change requeue count from 0 to 1, processKey 4294976587
[08/23 09:41:57.411] 1604 info: Replied with content null
[08/23 09:41:57.418] 1627 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:57.418] 1627 info: Job 283, Task 9724.1: Change requeue count from 0 to 1, processKey 4294977020
[08/23 09:41:57.418] 1604 info: Replied with content null
[08/23 09:41:57.423] 1634 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:57.423] 1634 info: Job 285, Task 9991.1: Change requeue count from 0 to 1, processKey 4294977287
[08/23 09:41:57.424] 1627 info: Replied with content null
[08/23 09:41:57.426] 1637 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:57.428] 1634 info: Job 281, Task 9731.2: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:57.431] 1616 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:57.431] 1616 info: Job 281, Task 9731.1: Change requeue count from 0 to 1, processKey 4294977027
[08/23 09:41:57.431] 1616 info: Replied with content null
[08/23 09:41:57.437] 1611 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:57.437] 1611 info: Job 277, Task 9723.1: Change requeue count from 0 to 1, processKey 4294977019
[08/23 09:41:57.437] 1611 info: Replied with content null
[08/23 09:41:57.445] 1620 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:57.448] 1606 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:57.449] 1625 info: Job 270, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:57.451] 1624 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:57.451] 1624 info: Job 270, Task 10404.1: Change requeue count from 0 to 1, processKey 4294977700
[08/23 09:41:57.452] 1624 info: Replied with content null
[08/23 09:41:57.457] 1622 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:57.457] 1622 info: Job 267, Task 9291.2: Change requeue count from 1 to 2, processKey 4294976587
[08/23 09:41:57.457] 1622 warning: Job 267, Task 9291.2: The task has started already.
[08/23 09:41:57.457] 1615 info: Job 276, Task 999.999: EndJob: starting
[08/23 09:41:57.457] 1622 info: Replied with content null
[08/23 09:41:57.457] 1615 warning: Job 276, Task 999.999: EndJob: Job is already finished
[08/23 09:41:57.457] 1617 info: Replied with content null
[08/23 09:41:57.464] 1623 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:57.464] 6631 info: Started reading pipe thread
[08/23 09:41:57.464] 1623 info: Job 277, Task 9723.2: Change requeue count from 1 to 2, processKey 4294977019
[08/23 09:41:57.464] 6637 info: Started reading pipe thread
[08/23 09:41:57.465] 1623 warning: Job 277, Task 9723.2: The task has started already.
[08/23 09:41:57.465] 1618 info: Job 283, Task 999.999: EndJob: starting
[08/23 09:41:57.465] 1623 info: Replied with content null
[08/23 09:41:57.466] 6654 info: Started reading pipe thread
[08/23 09:41:57.476] 6741 info: Started reading pipe thread
[08/23 09:41:57.480] 6771 info: Started reading pipe thread
[08/23 09:41:57.486] 6393 info: Job 283, Task 9724.1: Process 6627: exit code 137
[08/23 09:41:57.489] 6393 warning: Executing head -c 1500 /tmp/nodemanager_task_9724_1.Ws0NAF/stdout.txt, error code 1
[08/23 09:41:57.492] 6393 warning: Executing head -c 1500 /tmp/nodemanager_task_9724_1.Ws0NAF/stderr.txt, error code 1
[08/23 09:41:57.502] 6890 info: Started reading pipe thread
[08/23 09:41:57.512] 1618 info: Job 283, Task 999.999: EndJob: ended {"JobId":283,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9724,"TaskRequeueCount":1,"UserProcessorTime":0,"WorkingSet":1560}]}
[08/23 09:41:57.512] 1618 info: Job 283, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:57.512] 1618 info: Job 283, Task 999.999: EndJob: 6 jobs associated with the user root
[08/23 09:41:57.512] 1618 info: Replied with content {"JobId":283,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9724,"TaskRequeueCount":1,"UserProcessorTime":0,"WorkingSet":1560}]}
[08/23 09:41:57.518] 1608 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:57.518] 1608 info: Job 276, Task 10119.2: Change requeue count from 0 to 2, processKey 8589944711
[08/23 09:41:57.518] 1641 info: Replied with content null
[08/23 09:41:57.523] 1610 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:57.523] 1610 info: Job 283, Task 9724.2: Change requeue count from 0 to 2, processKey 8589944316
[08/23 09:41:57.524] 1629 info: Job 280, Task 999.999: EndJob: starting
[08/23 09:41:57.524] 1629 warning: Job 280, Task 999.999: EndJob: Job is already finished
[08/23 09:41:57.524] 1633 info: Replied with content null
[08/23 09:41:57.524] 1629 info: Job 280, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:57.524] 1629 info: Job 280, Task 999.999: EndJob: 7 jobs associated with the user root
[08/23 09:41:57.524] 1612 info: Job 268, Task 999.999: EndJob: starting
[08/23 09:41:57.524] 1612 warning: Job 268, Task 999.999: EndJob: Job is already finished
[08/23 09:41:57.524] 1617 info: Replied with content null
[08/23 09:41:57.525] 1612 info: Replied with content null
[08/23 09:41:57.525] 1609 info: Job 285, Task 999.999: EndJob: starting
[08/23 09:41:57.540] 1606 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:57.540] 1608 info: Job 270, Task 10404.2: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:57.541] 6406 info: Job 285, Task 9991.1: Process 6652: exit code 137
[08/23 09:41:57.561] 7170 info: Started reading pipe thread
[08/23 09:41:57.566] 7187 info: Started reading pipe thread
[08/23 09:41:57.567] 1609 info: Job 285, Task 999.999: EndJob: ended {"JobId":285,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":10,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9991,"TaskRequeueCount":1,"UserProcessorTime":10,"WorkingSet":4420}]}
[08/23 09:41:57.567] 1609 info: Job 285, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:57.567] 1609 info: Job 285, Task 999.999: EndJob: 6 jobs associated with the user root
[08/23 09:41:57.568] 1616 info: Replied with content {"JobId":285,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":10,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9991,"TaskRequeueCount":1,"UserProcessorTime":10,"WorkingSet":4420}]}
[08/23 09:41:57.572] 1628 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:57.572] 1628 info: Job 268, Task 9597.2: Change requeue count from 0 to 2, processKey 8589944189
[08/23 09:41:57.572] 1628 info: Replied with content null
[08/23 09:41:57.579] 1639 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:57.579] 1639 info: Job 285, Task 9991.2: Change requeue count from 0 to 2, processKey 8589944583
[08/23 09:41:57.580] 1632 info: Replied with content null
[08/23 09:41:57.587] 1624 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:57.587] 1630 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:57.588] 1618 info: Job 283, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:57.593] 1619 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:57.593] 1619 info: Job 268, Task 9597.1: Change requeue count from 2 to 1, processKey 8589944189
[08/23 09:41:57.593] 1619 warning: Job 268, Task 9597.1: The task has started already.
[08/23 09:41:57.594] 1616 info: Replied with content null
[08/23 09:41:57.599] 1640 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:57.599] 1640 info: Job 276, Task 10119.1: Change requeue count from 2 to 1, processKey 8589944711
[08/23 09:41:57.599] 1640 warning: Job 276, Task 10119.1: The task has started already.
[08/23 09:41:57.599] 1625 info: Job 270, Task 999.999: EndJob: starting
[08/23 09:41:57.600] 1606 info: Replied with content null
[08/23 09:41:57.616] 6538 info: Job 270, Task 10404.1: Process 6889: exit code 137
[08/23 09:41:57.619] 7420 info: Started reading pipe thread
[08/23 09:41:57.624] 7443 info: Started reading pipe thread
[08/23 09:41:57.642] 1625 info: Job 270, Task 999.999: EndJob: ended {"JobId":270,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":10404,"TaskRequeueCount":1,"UserProcessorTime":50,"WorkingSet":7080}]}
[08/23 09:41:57.642] 1625 info: Job 270, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:57.642] 1625 info: Job 270, Task 999.999: EndJob: 8 jobs associated with the user root
[08/23 09:41:57.642] 1638 info: Job 267, Task 999.999: EndJob: starting
[08/23 09:41:57.644] 1626 info: Replied with content {"JobId":270,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":10404,"TaskRequeueCount":1,"UserProcessorTime":50,"WorkingSet":7080}]}
[08/23 09:41:57.645] 1605 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:57.645] 1628 info: Job 283, Task 9724.3: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:57.646] 1622 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:57.647] 1641 info: Job 285, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:57.661] 6380 info: Job 267, Task 9291.1: Process 6634: exit code 137
[08/23 09:41:57.684] 1638 info: Job 267, Task 999.999: EndJob: ended {"JobId":267,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":10,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9291,"TaskRequeueCount":2,"UserProcessorTime":120,"WorkingSet":10092}]}
[08/23 09:41:57.684] 1638 info: Job 267, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:57.684] 1638 info: Job 267, Task 999.999: EndJob: 7 jobs associated with the user root
[08/23 09:41:57.684] 1611 info: Replied with content {"JobId":267,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":10,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9291,"TaskRequeueCount":2,"UserProcessorTime":120,"WorkingSet":10092}]}
[08/23 09:41:57.684] 1621 info: Job 277, Task 999.999: EndJob: starting
[08/23 09:41:57.703] 1603 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:57.703] 6448 info: Job 277, Task 9723.1: Process 6769: exit code 137
[08/23 09:41:57.704] 1604 info: Job 285, Task 9991.3: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:57.711] 1615 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:57.711] 1623 info: Job 270, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:57.727] 1621 info: Job 277, Task 999.999: EndJob: ended {"JobId":277,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":10,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9723,"TaskRequeueCount":2,"UserProcessorTime":150,"WorkingSet":10916}]}
[08/23 09:41:57.727] 1621 info: Job 277, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:57.727] 1621 info: Job 277, Task 999.999: EndJob: 6 jobs associated with the user root
[08/23 09:41:57.727] 1631 info: Replied with content {"JobId":277,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":10,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9723,"TaskRequeueCount":2,"UserProcessorTime":150,"WorkingSet":10916}]}
[08/23 09:41:57.727] 1613 info: Job 269, Task 999.999: EndJob: starting
[08/23 09:41:57.727] 1613 warning: Job 269, Task 999.999: EndJob: Job is already finished
[08/23 09:41:57.727] 1613 info: Job 269, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:57.727] 1613 info: Job 269, Task 999.999: EndJob: 5 jobs associated with the user root
[08/23 09:41:57.727] 1633 info: Replied with content null
[08/23 09:41:57.733] 1614 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:57.733] 1614 info: Job 269, Task 9810.2: Change requeue count from 0 to 2, processKey 8589944402
[08/23 09:41:57.733] 1640 info: Replied with content null
[08/23 09:41:57.739] 1635 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:57.739] 1635 info: Job 280, Task 10238.2: Change requeue count from 0 to 2, processKey 8589944830
[08/23 09:41:57.739] 1636 info: Job 281, Task 999.999: EndJob: starting
[08/23 09:41:57.739] 1624 info: Replied with content null
[08/23 09:41:57.771] 1607 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:57.771] 1614 info: Job 267, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:57.783] 1612 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:57.784] 1610 info: Job 270, Task 10404.3: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:57.785] 6425 info: Job 281, Task 9731.1: Process 6739: exit code 137
[08/23 09:41:57.789] 1606 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:57.789] 1624 info: Job 277, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:57.804] 8037 info: Started reading pipe thread
[08/23 09:41:57.805] 8040 info: Started reading pipe thread
[08/23 09:41:57.809] 1636 info: Job 281, Task 999.999: EndJob: ended {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":20,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":1,"UserProcessorTime":230,"WorkingSet":12876}]}
[08/23 09:41:57.809] 1636 info: Job 281, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:57.809] 1636 info: Job 281, Task 999.999: EndJob: 6 jobs associated with the user root
[08/23 09:41:57.809] 1636 info: Replied with content {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":20,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":1,"UserProcessorTime":230,"WorkingSet":12876}]}
[08/23 09:41:57.809] 1630 info: Job 269, Task 9810.1: Change requeue count from 2 to 1, processKey 8589944402
[08/23 09:41:57.809] 1630 warning: Job 269, Task 9810.1: The task has started already.
[08/23 09:41:57.809] 1618 info: Job 283, Task 999.999: EndJob: starting
[08/23 09:41:57.810] 1628 info: Replied with content null
[08/23 09:41:57.838] 1638 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:57.838] 1613 info: Job 267, Task 9291.3: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:57.857] 1617 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:57.857] 1626 info: Job 277, Task 9723.3: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:57.882] 1622 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:57.882] 1621 info: Job 281, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:57.931] 6970 info: Job 283, Task 9724.2: Process 7185: exit code 137
[08/23 09:41:57.941] 1640 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:57.941] 1609 info: Job 281, Task 9731.3: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:57.956] 1618 info: Job 283, Task 999.999: EndJob: ended {"JobId":283,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":10,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9724,"TaskRequeueCount":2,"UserProcessorTime":190,"WorkingSet":11996}]}
[08/23 09:41:57.956] 1618 info: Job 283, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:57.956] 1618 info: Job 283, Task 999.999: EndJob: 5 jobs associated with the user root
[08/23 09:41:57.957] 1618 info: Replied with content {"JobId":283,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":10,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9724,"TaskRequeueCount":2,"UserProcessorTime":190,"WorkingSet":11996}]}
[08/23 09:41:57.962] 1634 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:57.962] 1634 info: Job 281, Task 9731.2: Change requeue count from 0 to 2, processKey 8589944323
[08/23 09:41:57.962] 1620 info: Job 280, Task 10238.1: Change requeue count from 2 to 1, processKey 8589944830
[08/23 09:41:57.962] 1620 warning: Job 280, Task 10238.1: The task has started already.
[08/23 09:41:57.962] 1627 info: Replied with content null
[08/23 09:41:57.964] 1620 info: Replied with content null
[08/23 09:41:57.969] 1619 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:57.969] 1641 info: Job 285, Task 999.999: EndJob: starting
[08/23 09:41:57.987] 7261 info: Job 285, Task 9991.2: Process 7441: exit code 137
[08/23 09:41:58.003] 8444 info: Started reading pipe thread
[08/23 09:41:58.011] 1641 info: Job 285, Task 999.999: EndJob: ended {"JobId":285,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":30,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9991,"TaskRequeueCount":2,"UserProcessorTime":290,"WorkingSet":14208}]}
[08/23 09:41:58.011] 1641 info: Job 285, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:58.011] 1641 info: Job 285, Task 999.999: EndJob: 6 jobs associated with the user root
[08/23 09:41:58.012] 1630 info: Replied with content {"JobId":285,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":30,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9991,"TaskRequeueCount":2,"UserProcessorTime":290,"WorkingSet":14208}]}
[08/23 09:41:58.012] 1623 info: Job 270, Task 999.999: EndJob: starting
[08/23 09:41:58.012] 1623 warning: Job 270, Task 999.999: EndJob: Job is already finished
[08/23 09:41:58.012] 1613 info: Replied with content null
[08/23 09:41:58.015] 1603 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:58.015] 1638 info: Job 283, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:58.017] 1608 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:58.017] 1608 info: Job 270, Task 10404.2: Change requeue count from 0 to 2, processKey 8589944996
[08/23 09:41:58.017] 1628 info: Replied with content null
[08/23 09:41:58.017] 1614 info: Job 267, Task 999.999: EndJob: starting
[08/23 09:41:58.017] 1614 warning: Job 267, Task 999.999: EndJob: Job is already finished
[08/23 09:41:58.018] 1640 info: Replied with content null
[08/23 09:41:58.024] 1610 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:58.024] 1610 info: Job 270, Task 10404.3: Change requeue count from 2 to 3, processKey 8589944996
[08/23 09:41:58.024] 1610 warning: Job 270, Task 10404.3: The task has started already.
[08/23 09:41:58.024] 1608 info: Replied with content null
[08/23 09:41:58.030] 1616 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:58.030] 1616 info: Job 267, Task 9291.3: Change requeue count from 0 to 3, processKey 12884911179
[08/23 09:41:58.030] 1621 info: Job 281, Task 999.999: EndJob: starting
[08/23 09:41:58.030] 1606 info: Replied with content null
[08/23 09:41:58.062] 8650 info: Started reading pipe thread
[08/23 09:41:58.070] 1630 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:58.070] 1609 info: Job 283, Task 9724.4: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:58.071] 8667 info: Started reading pipe thread
[08/23 09:41:58.097] 1633 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:58.097] 1634 info: Job 285, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:58.151] 1631 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:58.151] 1603 info: Job 285, Task 9991.4: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:58.152] 8290 info: Job 281, Task 9731.2: Process 8441: exit code 137
[08/23 09:41:58.176] 1621 info: Job 281, Task 999.999: EndJob: ended {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":2,"UserProcessorTime":0,"WorkingSet":3424}]}
[08/23 09:41:58.176] 1621 info: Job 281, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:58.176] 1621 info: Job 281, Task 999.999: EndJob: 7 jobs associated with the user root
[08/23 09:41:58.176] 1621 info: Replied with content {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":2,"UserProcessorTime":0,"WorkingSet":3424}]}
[08/23 09:41:58.181] 1615 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:58.181] 1615 info: Job 281, Task 9731.3: Change requeue count from 0 to 3, processKey 12884911619
[08/23 09:41:58.181] 1633 info: Replied with content null
[08/23 09:41:58.182] 1619 info: Job 283, Task 9724.3: Change requeue count from 0 to 3, processKey 12884911612
[08/23 09:41:58.183] 1627 info: Replied with content null
[08/23 09:41:58.189] 1625 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:58.189] 1625 info: Job 285, Task 9991.3: Change requeue count from 0 to 3, processKey 12884911879
[08/23 09:41:58.189] 1638 info: Job 283, Task 999.999: EndJob: starting
[08/23 09:41:58.189] 1639 info: Replied with content null
[08/23 09:41:58.243] 1637 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:58.243] 1613 info: Job 281, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:58.247] 9039 info: Started reading pipe thread
[08/23 09:41:58.249] 9042 info: Started reading pipe thread
[08/23 09:41:58.250] 9047 info: Started reading pipe thread
[08/23 09:41:58.252] 1638 info: Job 283, Task 999.999: EndJob: ended {"JobId":283,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9724,"TaskRequeueCount":3,"UserProcessorTime":0,"WorkingSet":0}]}
[08/23 09:41:58.252] 1638 info: Job 283, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:58.252] 1638 info: Job 283, Task 999.999: EndJob: 8 jobs associated with the user root
[08/23 09:41:58.252] 1638 info: Replied with content {"JobId":283,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9724,"TaskRequeueCount":3,"UserProcessorTime":0,"WorkingSet":0}]}
[08/23 09:41:58.252] 1624 info: Job 277, Task 999.999: EndJob: starting
[08/23 09:41:58.252] 1624 warning: Job 277, Task 999.999: EndJob: Job is already finished
[08/23 09:41:58.252] 1632 info: Replied with content null
[08/23 09:41:58.258] 1626 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:58.258] 1626 info: Job 277, Task 9723.3: Change requeue count from 0 to 3, processKey 12884911611
[08/23 09:41:58.258] 1626 info: Replied with content null
[08/23 09:41:58.265] 1620 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:58.265] 1620 info: Job 283, Task 9724.4: Change requeue count from 0 to 4, processKey 17179878908
[08/23 09:41:58.266] 1609 info: Replied with content null
[08/23 09:41:58.272] 1603 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:58.272] 1603 info: Job 285, Task 9991.4: Change requeue count from 3 to 4, processKey 12884911879
[08/23 09:41:58.272] 1603 warning: Job 285, Task 9991.4: The task has started already.
[08/23 09:41:58.272] 1603 info: Replied with content null
[08/23 09:41:58.272] 1613 info: Job 281, Task 999.999: EndJob: starting
[08/23 09:41:58.299] 1627 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:58.299] 1622 info: Job 281, Task 9731.4: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:58.307] 9234 info: Started reading pipe thread
[08/23 09:41:58.308] 9245 info: Started reading pipe thread
[08/23 09:41:58.321] 1617 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:58.321] 1635 info: Job 283, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:58.375] 1630 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:58.375] 1606 info: Job 283, Task 9724.5: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:58.392] 8825 info: Job 281, Task 9731.3: Process 9040: exit code 137
[08/23 09:41:58.416] 1613 info: Job 281, Task 999.999: EndJob: ended {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":3,"UserProcessorTime":10,"WorkingSet":3316}]}
[08/23 09:41:58.416] 1613 info: Job 281, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:58.416] 1613 info: Job 281, Task 999.999: EndJob: 9 jobs associated with the user root
[08/23 09:41:58.416] 1613 info: Replied with content {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":3,"UserProcessorTime":10,"WorkingSet":3316}]}
[08/23 09:41:58.416] 1634 info: Job 285, Task 999.999: EndJob: starting
[08/23 09:41:58.435] 8850 info: Job 285, Task 9991.3: Process 9044: exit code 137
[08/23 09:41:58.460] 1634 info: Job 285, Task 999.999: EndJob: ended {"JobId":285,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9991,"TaskRequeueCount":4,"UserProcessorTime":120,"WorkingSet":10456}]}
[08/23 09:41:58.460] 1634 info: Job 285, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:58.460] 1634 info: Job 285, Task 999.999: EndJob: 8 jobs associated with the user root
[08/23 09:41:58.460] 1618 info: Replied with content {"JobId":285,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9991,"TaskRequeueCount":4,"UserProcessorTime":120,"WorkingSet":10456}]}
[08/23 09:41:58.465] 1622 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:58.465] 1622 info: Job 281, Task 9731.4: Change requeue count from 0 to 4, processKey 17179878915
[08/23 09:41:58.465] 1622 info: Replied with content null
[08/23 09:41:58.472] 1606 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:58.472] 1606 info: Job 283, Task 9724.5: Change requeue count from 4 to 5, processKey 17179878908
[08/23 09:41:58.472] 1606 warning: Job 283, Task 9724.5: The task has started already.
[08/23 09:41:58.473] 1635 info: Job 283, Task 999.999: EndJob: starting
[08/23 09:41:58.473] 1636 info: Replied with content null
[08/23 09:41:58.477] 1607 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:58.477] 1617 info: Job 281, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:58.525] 1625 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:58.525] 1637 info: Job 285, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:58.536] 1630 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:58.536] 1623 info: Job 281, Task 9731.5: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:58.556] 9641 info: Started reading pipe thread
[08/23 09:41:58.585] 1629 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:58.585] 1610 info: Job 285, Task 9991.5: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:58.643] 9112 info: Job 283, Task 9724.4: Process 9243: exit code 137
[08/23 09:41:58.666] 1635 info: Job 283, Task 999.999: EndJob: ended {"JobId":283,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9724,"TaskRequeueCount":5,"UserProcessorTime":90,"WorkingSet":9984}]}
[08/23 09:41:58.666] 1635 info: Job 283, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:58.666] 1635 info: Job 283, Task 999.999: EndJob: 8 jobs associated with the user root
[08/23 09:41:58.666] 1635 info: Replied with content {"JobId":283,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9724,"TaskRequeueCount":5,"UserProcessorTime":90,"WorkingSet":9984}]}
[08/23 09:41:58.666] 1617 info: Job 281, Task 999.999: EndJob: starting
[08/23 09:41:58.682] 9526 info: Job 281, Task 9731.4: Process 9638: exit code 137
[08/23 09:41:58.705] 1617 info: Job 281, Task 999.999: EndJob: ended {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":10,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":4,"UserProcessorTime":40,"WorkingSet":6996}]}
[08/23 09:41:58.705] 1617 info: Job 281, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:58.705] 1617 info: Job 281, Task 999.999: EndJob: 7 jobs associated with the user root
[08/23 09:41:58.705] 1637 info: Job 285, Task 999.999: EndJob: starting
[08/23 09:41:58.705] 1637 warning: Job 285, Task 999.999: EndJob: Job is already finished
[08/23 09:41:58.705] 1603 info: Replied with content {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":10,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":4,"UserProcessorTime":40,"WorkingSet":6996}]}
[08/23 09:41:58.705] 1637 info: Replied with content null
[08/23 09:41:58.710] 1623 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:58.710] 1623 info: Job 281, Task 9731.5: Change requeue count from 0 to 5, processKey 21474846211
[08/23 09:41:58.711] 1604 info: Replied with content null
[08/23 09:41:58.716] 1610 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:58.716] 1610 info: Job 285, Task 9991.5: Change requeue count from 0 to 5, processKey 21474846471
[08/23 09:41:58.716] 1606 info: Replied with content null
[08/23 09:41:58.733] 1613 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:58.733] 1608 info: Job 283, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:58.733] 1608 info: Job 283, Task 999.999: EndJob: starting
[08/23 09:41:58.733] 1608 warning: Job 283, Task 999.999: EndJob: Job is already finished
[08/23 09:41:58.733] 1608 info: Replied with content null
[08/23 09:41:58.771] 1613 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:58.772] 1636 info: Job 281, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:58.772] 1636 info: Job 281, Task 999.999: EndJob: starting
[08/23 09:41:58.792] 1613 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:58.792] 1612 info: Job 283, Task 9724.6: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:58.795] 10064 info: Started reading pipe thread
[08/23 09:41:58.795] 10069 info: Started reading pipe thread
[08/23 09:41:58.862] 1619 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:58.862] 1623 info: Job 281, Task 9731.6: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:59.026] 1636 warning: Job 281, Task 9731.5: The task didn't exit within 1s, process Ids 10113 10132 10135 10138 10160 10207 10208 10211 10225 10242 10450 10451
[08/23 09:41:59.026] 1636 info: Job 281, Task 999.999: EndJob: ended {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":0,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":12,"PrimaryTask":1,"ProcessIds":"10113,10132,10135,10138,10160,10207,10208,10211,10225,10242,10450,10451","TaskId":9731,"TaskRequeueCount":5,"UserProcessorTime":170,"WorkingSet":11488}]}
[08/23 09:41:59.026] 1636 info: Job 281, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:59.026] 1636 info: Job 281, Task 999.999: EndJob: 8 jobs associated with the user root
[08/23 09:41:59.027] 1628 info: Replied with content {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":0,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":12,"PrimaryTask":1,"ProcessIds":"10113,10132,10135,10138,10160,10207,10208,10211,10225,10242,10450,10451","TaskId":9731,"TaskRequeueCount":5,"UserProcessorTime":170,"WorkingSet":11488}]}
[08/23 09:41:59.031] 1629 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:59.031] 1629 info: Job 283, Task 9724.6: Change requeue count from 0 to 6, processKey 25769813500
[08/23 09:41:59.031] 1629 info: Replied with content null
[08/23 09:41:59.036] 1623 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:59.036] 1623 info: Job 281, Task 9731.6: Change requeue count from 0 to 6, processKey 25769813507
[08/23 09:41:59.036] 1623 info: Replied with content null
[08/23 09:41:59.097] 10549 info: Started reading pipe thread
[08/23 09:41:59.098] 10551 info: Started reading pipe thread
[08/23 09:41:59.112] 1609 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:59.112] 1602 info: Job 281, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:59.112] 1602 info: Job 281, Task 999.999: EndJob: starting
[08/23 09:41:59.167] 1609 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:59.168] 1613 info: Job 281, Task 9731.7: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:59.231] 10469 info: Job 281, Task 9731.6: Process 10550: exit code 137
[08/23 09:41:59.254] 1602 info: Job 281, Task 999.999: EndJob: ended {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":10,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":6,"UserProcessorTime":0,"WorkingSet":3404}]}
[08/23 09:41:59.254] 1602 info: Job 281, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:59.254] 1602 info: Job 281, Task 999.999: EndJob: 9 jobs associated with the user root
[08/23 09:41:59.256] 1629 info: Replied with content {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":10,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":6,"UserProcessorTime":0,"WorkingSet":3404}]}
[08/23 09:41:59.260] 1633 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:59.260] 1633 info: Job 281, Task 9731.7: Change requeue count from 0 to 7, processKey 30064780803
[08/23 09:41:59.260] 1628 info: Replied with content null
[08/23 09:41:59.328] 10781 info: Started reading pipe thread
[08/23 09:41:59.334] 1637 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:59.334] 1641 info: Job 281, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:59.334] 1641 info: Job 281, Task 999.999: EndJob: starting
[08/23 09:41:59.356] 10691 info: Job 281, Task 9731.7: Process 10780: exit code 137
[08/23 09:41:59.373] 1641 info: Job 281, Task 999.999: EndJob: ended {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":7,"UserProcessorTime":0,"WorkingSet":2560}]}
[08/23 09:41:59.373] 1641 info: Job 281, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:59.373] 1641 info: Job 281, Task 999.999: EndJob: 9 jobs associated with the user root
[08/23 09:41:59.373] 1641 info: Replied with content {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":7,"UserProcessorTime":0,"WorkingSet":2560}]}
[08/23 09:41:59.422] 1609 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:59.422] 1615 info: Job 281, Task 9731.8: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:59.428] 1603 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:59.428] 1603 info: Job 281, Task 9731.8: Change requeue count from 0 to 8, processKey 34359748099
[08/23 09:41:59.428] 1603 info: Replied with content null
[08/23 09:41:59.433] 1619 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:59.434] 1634 info: Job 281, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:59.434] 1634 info: Job 281, Task 999.999: EndJob: starting
[08/23 09:41:59.489] 11053 info: Started reading pipe thread
[08/23 09:41:59.492] 1606 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:59.492] 1607 info: Job 281, Task 9731.9: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:59.494] 1634 info: Job 281, Task 999.999: EndJob: ended {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":8,"UserProcessorTime":0,"WorkingSet":0}]}
[08/23 09:41:59.494] 1634 info: Job 281, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:59.494] 1634 info: Job 281, Task 999.999: EndJob: 9 jobs associated with the user root
[08/23 09:41:59.494] 1634 info: Replied with content {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":8,"UserProcessorTime":0,"WorkingSet":0}]}
[08/23 09:41:59.499] 1633 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:59.499] 1633 info: Job 281, Task 9731.9: Change requeue count from 0 to 9, processKey 38654715395
[08/23 09:41:59.499] 1603 info: Replied with content null
[08/23 09:41:59.538] 11131 info: Started reading pipe thread
[08/23 09:41:59.563] 1641 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:59.563] 1620 info: Job 281, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:59.563] 1620 info: Job 281, Task 999.999: EndJob: starting
[08/23 09:41:59.619] 1606 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:59.619] 1628 info: Job 281, Task 9731.10: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:59.681] 11072 info: Job 281, Task 9731.9: Process 11130: exit code 137
[08/23 09:41:59.704] 1620 info: Job 281, Task 999.999: EndJob: ended {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":10,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":9,"UserProcessorTime":0,"WorkingSet":3404}]}
[08/23 09:41:59.704] 1620 info: Job 281, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:59.704] 1620 info: Job 281, Task 999.999: EndJob: 9 jobs associated with the user root
[08/23 09:41:59.704] 1620 info: Replied with content {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":10,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":9,"UserProcessorTime":0,"WorkingSet":3404}]}
[08/23 09:41:59.709] 1628 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:59.709] 1628 info: Job 281, Task 9731.10: Change requeue count from 0 to 10, processKey 42949682691
[08/23 09:41:59.709] 1619 info: Replied with content null
[08/23 09:41:59.772] 11350 info: Started reading pipe thread
[08/23 09:41:59.786] 1604 info: Request: Uri /api/IP-172-19-3-105/endjob
[08/23 09:41:59.786] 1622 info: Job 281, Task 0.0: filters/OnJobEnd.sh not detected, skip
[08/23 09:41:59.786] 1622 info: Job 281, Task 999.999: EndJob: starting
[08/23 09:41:59.905] 11259 info: Job 281, Task 9731.10: Process 11349: exit code 137
[08/23 09:41:59.927] 1622 info: Job 281, Task 999.999: EndJob: ended {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":10,"UserProcessorTime":10,"WorkingSet":3404}]}
[08/23 09:41:59.927] 1622 info: Job 281, Task 999.999: EndJob: Cleanup user root
[08/23 09:41:59.927] 1622 info: Job 281, Task 999.999: EndJob: 9 jobs associated with the user root
[08/23 09:41:59.927] 1622 info: Replied with content {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":10,"UserProcessorTime":10,"WorkingSet":3404}]}
[08/23 09:41:59.928] 1607 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:59.929] 1629 info: Job 267, Task 9294.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:59.934] 1641 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:59.934] 1641 info: Job 267, Task 9294.0: Change requeue count from 0 to 0, processKey 9294
[08/23 09:41:59.934] 1632 info: Replied with content null
[08/23 09:41:59.935] 1630 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:59.935] 1620 info: Job 267, Task 9297.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:59.940] 1611 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:59.940] 1611 info: Job 267, Task 9297.0: Change requeue count from 0 to 0, processKey 9297
[08/23 09:41:59.942] 1622 info: Replied with content null
[08/23 09:41:59.949] 1635 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:59.949] 1632 info: Job 267, Task 9300.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:59.954] 1632 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:59.954] 1632 info: Job 267, Task 9300.0: Change requeue count from 0 to 0, processKey 9300
[08/23 09:41:59.954] 1632 info: Replied with content null
[08/23 09:41:59.969] 1629 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:59.969] 1622 info: Job 269, Task 9820.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:59.972] 1620 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:59.972] 1614 info: Job 269, Task 9825.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:59.974] 1613 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:59.974] 1613 info: Job 269, Task 9820.0: Change requeue count from 0 to 0, processKey 9820
[08/23 09:41:59.975] 1619 info: Replied with content null
[08/23 09:41:59.977] 1618 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:41:59.977] 1606 info: Job 269, Task 9815.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:41:59.979] 1636 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:59.979] 1636 info: Job 269, Task 9825.0: Change requeue count from 0 to 0, processKey 9825
[08/23 09:41:59.980] 1610 info: Replied with content null
[08/23 09:41:59.984] 1606 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:41:59.984] 1606 info: Job 269, Task 9815.0: Change requeue count from 0 to 0, processKey 9815
[08/23 09:41:59.985] 1613 info: Replied with content null
[08/23 09:42:00.012] 1616 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:42:00.012] 1603 info: Job 283, Task 9727.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:42:00.018] 1629 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:42:00.018] 1629 info: Job 283, Task 9727.0: Change requeue count from 0 to 0, processKey 9727
[08/23 09:42:00.018] 1629 info: Replied with content null
[08/23 09:42:00.031] 11790 info: Started reading pipe thread
[08/23 09:42:00.037] 11827 info: Started reading pipe thread
[08/23 09:42:00.038] 11831 info: Started reading pipe thread
[08/23 09:42:00.039] 11836 info: Started reading pipe thread
[08/23 09:42:00.040] 11845 info: Started reading pipe thread
[08/23 09:42:00.040] 11848 info: Started reading pipe thread
[08/23 09:42:00.059] 1615 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:42:00.060] 1605 info: Job 283, Task 9736.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:42:00.067] 1605 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:42:00.067] 1605 info: Job 283, Task 9736.0: Change requeue count from 0 to 0, processKey 9736
[08/23 09:42:00.067] 1621 info: Replied with content null
[08/23 09:42:00.069] 11995 info: Started reading pipe thread
[08/23 09:42:00.071] 1619 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:42:00.071] 1607 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:42:00.071] 1620 info: Job 283, Task 9726.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:42:00.071] 1614 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:42:00.073] 1616 info: Job 268, Task 9599.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:42:00.073] 1604 info: Job 285, Task 9994.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:42:00.078] 1607 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:42:00.078] 1607 info: Job 283, Task 9726.0: Change requeue count from 0 to 0, processKey 9726
[08/23 09:42:00.078] 1629 info: Replied with content null
[08/23 09:42:00.080] 1609 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:42:00.081] 1634 info: Job 277, Task 9728.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:42:00.086] 1612 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:42:00.087] 1622 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:42:00.088] 1631 info: Job 277, Task 9725.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:42:00.092] 1637 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:42:00.092] 1637 info: Job 285, Task 9994.0: Change requeue count from 0 to 0, processKey 9994
[08/23 09:42:00.092] 1637 info: Replied with content null
[08/23 09:42:00.092] 1612 info: Job 268, Task 9599.0: Change requeue count from 0 to 0, processKey 9599
[08/23 09:42:00.094] 1629 info: Replied with content null
[08/23 09:42:00.120] 12169 info: Started reading pipe thread
[08/23 09:42:00.121] 12172 info: Started reading pipe thread
[08/23 09:42:00.128] 1631 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:42:00.128] 1631 info: Job 277, Task 9725.0: Change requeue count from 0 to 0, processKey 9725
[08/23 09:42:00.129] 1621 info: Replied with content null
[08/23 09:42:00.134] 1633 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:42:00.134] 1615 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:42:00.134] 1633 info: Job 285, Task 9993.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:42:00.134] 1626 info: Job 285, Task 9992.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:42:00.136] 1630 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:42:00.136] 1630 info: Job 277, Task 9728.0: Change requeue count from 0 to 0, processKey 9728
[08/23 09:42:00.138] 1625 info: Replied with content null
[08/23 09:42:00.139] 1602 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:42:00.140] 1636 info: Job 268, Task 9598.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:42:00.142] 1614 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:42:00.142] 1610 info: Job 268, Task 9600.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:42:00.145] 1622 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:42:00.145] 1622 info: Job 285, Task 9992.0: Change requeue count from 0 to 0, processKey 9992
[08/23 09:42:00.146] 1638 info: Replied with content null
[08/23 09:42:00.147] 1619 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:42:00.150] 1627 info: Job 277, Task 9729.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:42:00.151] 1635 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:42:00.151] 1635 info: Job 268, Task 9598.0: Change requeue count from 0 to 0, processKey 9598
[08/23 09:42:00.152] 1639 info: Replied with content null
[08/23 09:42:00.158] 1633 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:42:00.158] 1633 info: Job 285, Task 9993.0: Change requeue count from 0 to 0, processKey 9993
[08/23 09:42:00.159] 1623 info: Replied with content null
[08/23 09:42:00.166] 1610 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:42:00.166] 1610 info: Job 268, Task 9600.0: Change requeue count from 0 to 0, processKey 9600
[08/23 09:42:00.167] 1631 info: Replied with content null
[08/23 09:42:00.171] 12443 info: Started reading pipe thread
[08/23 09:42:00.172] 12447 info: Started reading pipe thread
[08/23 09:42:00.177] 1602 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:42:00.177] 12482 info: Started reading pipe thread
[08/23 09:42:00.177] 1602 info: Job 277, Task 9729.0: Change requeue count from 0 to 0, processKey 9729
[08/23 09:42:00.178] 1613 info: Replied with content null
[08/23 09:42:00.183] 1603 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:42:00.183] 1618 info: Job 276, Task 10121.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:42:00.189] 1604 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:42:00.189] 1632 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:42:00.190] 1611 info: Job 276, Task 10120.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:42:00.190] 1616 info: Job 276, Task 10122.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:42:00.192] 1610 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:42:00.192] 1610 info: Job 276, Task 10121.0: Change requeue count from 0 to 0, processKey 10121
[08/23 09:42:00.193] 1641 info: Replied with content null
[08/23 09:42:00.198] 12640 info: Started reading pipe thread
[08/23 09:42:00.198] 1631 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:42:00.199] 1626 info: Job 280, Task 10240.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:42:00.203] 12672 info: Started reading pipe thread
[08/23 09:42:00.203] 1602 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:42:00.203] 1602 info: Job 276, Task 10122.0: Change requeue count from 0 to 0, processKey 10122
[08/23 09:42:00.204] 1632 info: Replied with content null
[08/23 09:42:00.211] 12723 info: Started reading pipe thread
[08/23 09:42:00.211] 1626 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:42:00.212] 1626 info: Job 280, Task 10240.0: Change requeue count from 0 to 0, processKey 10240
[08/23 09:42:00.212] 1623 info: Replied with content null
[08/23 09:42:00.220] 1624 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:42:00.220] 1604 info: Job 280, Task 10239.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:42:00.221] 1629 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:42:00.223] 12804 info: Started reading pipe thread
[08/23 09:42:00.228] 12846 info: Started reading pipe thread
[08/23 09:42:00.230] 1638 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:42:00.230] 1610 info: Job 284, Task 10444.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:42:00.234] 1604 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:42:00.234] 1604 info: Job 280, Task 10239.0: Change requeue count from 0 to 0, processKey 10239
[08/23 09:42:00.234] 1629 info: Job 276, Task 10120.0: Change requeue count from 0 to 0, processKey 10120
[08/23 09:42:00.234] 1622 info: Replied with content null
[08/23 09:42:00.238] 1621 info: Replied with content null
[08/23 09:42:00.244] 12945 info: Started reading pipe thread
[08/23 09:42:00.244] 1628 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:42:00.245] 1604 info: Job 284, Task 10445.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:42:00.250] 1636 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:42:00.250] 1613 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:42:00.252] 1623 info: Job 270, Task 10405.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:42:00.255] 13019 info: Started reading pipe thread
[08/23 09:42:00.260] 1604 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:42:00.260] 1604 info: Job 284, Task 10445.0: Change requeue count from 0 to 0, processKey 10445
[08/23 09:42:00.261] 1630 info: Replied with content null
[08/23 09:42:00.265] 1626 info: Request: Uri /api/IP-172-19-3-105/startjobandtask
[08/23 09:42:00.266] 1640 info: Job 270, Task 10406.0: filters/OnJobTaskStart.sh not detected, skip
[08/23 09:42:00.270] 1637 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:42:00.270] 1613 info: Job 284, Task 10444.0: Change requeue count from 0 to 0, processKey 10444
[08/23 09:42:00.270] 13115 info: Started reading pipe thread
[08/23 09:42:00.273] 1636 info: Replied with content null
[08/23 09:42:00.282] 13184 info: Started reading pipe thread
[08/23 09:42:00.282] 1641 info: File /root/.ssh/id_rsa exist, skip overwriting
[08/23 09:42:00.282] 1637 info: Job 270, Task 10405.0: Change requeue count from 0 to 0, processKey 10405
[08/23 09:42:00.282] 1641 info: Job 270, Task 10406.0: Change requeue count from 0 to 0, processKey 10406
[08/23 09:42:00.283] 1634 info: Replied with content null
[08/23 09:42:00.285] 1624 info: Replied with content null
[08/23 09:42:00.299] 13298 info: Started reading pipe thread
[08/23 09:42:00.300] 13302 info: Started reading pipe thread
[08/23 09:42:00.327] 13444 info: Started reading pipe thread
[08/23 09:42:00.336] 13482 info: Started reading pipe thread
[08/23 09:42:00.345] 13529 info: Started reading pipe thread
[08/23 09:42:00.347] 13543 info: Started reading pipe thread
[08/23 09:42:23.471] 1645 info: ResolveServiceLocation> Resolved serviceLocation IP-AC130210 for SchedulerStatefulService
[08/23 09:42:23.471] 1645 info: Update tasks' statistics.
[08/23 09:42:23.934] 1643 warning: Failed to query Azure node metadata. Status code: 404. Remaining retry count: 3.
[08/23 09:42:54.260] 1645 info: ResolveServiceLocation> Resolved serviceLocation IP-AC130210 for SchedulerStatefulService
[08/23 09:42:54.260] 1645 info: Update tasks' statistics.
[08/23 09:42:54.568] 1643 warning: Failed to query Azure node metadata. Status code: 404. Remaining retry count: 2.
[08/23 09:43:25.047] 1645 info: ResolveServiceLocation> Resolved serviceLocation IP-AC130210 for SchedulerStatefulService
[08/23 09:43:25.047] 1645 info: Update tasks' statistics.
[08/23 09:43:25.210] 1643 warning: Failed to query Azure node metadata. Status code: 404. Remaining retry count: 1.
[08/23 09:43:55.840] 1645 info: ResolveServiceLocation> Resolved serviceLocation IP-AC130210 for SchedulerStatefulService
[08/23 09:43:55.840] 1645 info: Update tasks' statistics.
[08/23 09:43:55.867] 1643 warning: Failed to query Azure node metadata. Status code: 404. Remaining retry count: 0.
[08/23 09:44:26.686] 1645 info: ResolveServiceLocation> Resolved serviceLocation IP-AC130210 for SchedulerStatefulService
[08/23 09:44:26.686] 1645 info: Update tasks' statistics.
[08/23 09:44:30.642] 11592 info: Job 269, Task 9815.0: Process 11834: exit code 0
[08/23 09:44:33.365] 12088 info: Job 285, Task 9994.0: Process 12440: exit code 0

Job 281 failed with "Job failed to start on some nodes or some nodes became unreachable" eventhough some of its tasks have completed successfully on the node IP-172-19-3-105. Here is the same log filtered for Job 281. What is unusual here, is that task 9731 was killed quite a few times but retry Task 9731.5 and Task 9731.8 have succeeded. The log also suggests that those two were running in parallel at the same time which for tasks that depend on modifying some state will lead to corruption of that state.

Line 104: [08/23 09:41:54.192] 1630 info: Job 281, Task 9731.0: filters/OnJobTaskStart.sh not detected, skip
Line 111: [08/23 09:41:54.203] 1637 info: Job 281, Task 9731.0: Change requeue count from 0 to 0, processKey 9731
Line 197: [08/23 09:41:56.162] 1631 info: Job 281, Task 0.0: filters/OnJobEnd.sh not detected, skip
Line 242: [08/23 09:41:56.474] 1626 info: Job 281, Task 9731.1: filters/OnJobTaskStart.sh not detected, skip
Line 341: [08/23 09:41:57.227] 1631 info: Job 281, Task 999.999: EndJob: starting
Line 343: [08/23 09:41:57.244] 2774 info: Job 281, Task 9731.0: Process 3027: exit code 137
Line 344: [08/23 09:41:57.267] 1631 info: Job 281, Task 999.999: EndJob: ended {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":40,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":0,"UserProcessorTime":690,"WorkingSet":25192}]}
Line 345: [08/23 09:41:57.267] 1631 info: Job 281, Task 999.999: EndJob: Cleanup user root
Line 346: [08/23 09:41:57.268] 1631 info: Job 281, Task 999.999: EndJob: 2 jobs associated with the user root
Line 350: [08/23 09:41:57.342] 1636 info: Job 281, Task 0.0: filters/OnJobEnd.sh not detected, skip
Line 372: [08/23 09:41:57.428] 1634 info: Job 281, Task 9731.2: filters/OnJobTaskStart.sh not detected, skip
Line 374: [08/23 09:41:57.431] 1616 info: Job 281, Task 9731.1: Change requeue count from 0 to 1, processKey 4294977027
Line 489: [08/23 09:41:57.739] 1636 info: Job 281, Task 999.999: EndJob: starting
Line 495: [08/23 09:41:57.785] 6425 info: Job 281, Task 9731.1: Process 6739: exit code 137
Line 500: [08/23 09:41:57.809] 1636 info: Job 281, Task 999.999: EndJob: ended {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":20,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":1,"UserProcessorTime":230,"WorkingSet":12876}]}
Line 501: [08/23 09:41:57.809] 1636 info: Job 281, Task 999.999: EndJob: Cleanup user root
Line 502: [08/23 09:41:57.809] 1636 info: Job 281, Task 999.999: EndJob: 6 jobs associated with the user root
Line 513: [08/23 09:41:57.882] 1621 info: Job 281, Task 0.0: filters/OnJobEnd.sh not detected, skip
Line 516: [08/23 09:41:57.941] 1609 info: Job 281, Task 9731.3: filters/OnJobTaskStart.sh not detected, skip
Line 522: [08/23 09:41:57.962] 1634 info: Job 281, Task 9731.2: Change requeue count from 0 to 2, processKey 8589944323
Line 552: [08/23 09:41:58.030] 1621 info: Job 281, Task 999.999: EndJob: starting
Line 562: [08/23 09:41:58.152] 8290 info: Job 281, Task 9731.2: Process 8441: exit code 137
Line 563: [08/23 09:41:58.176] 1621 info: Job 281, Task 999.999: EndJob: ended {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":2,"UserProcessorTime":0,"WorkingSet":3424}]}
Line 564: [08/23 09:41:58.176] 1621 info: Job 281, Task 999.999: EndJob: Cleanup user root
Line 565: [08/23 09:41:58.176] 1621 info: Job 281, Task 999.999: EndJob: 7 jobs associated with the user root
Line 568: [08/23 09:41:58.181] 1615 info: Job 281, Task 9731.3: Change requeue count from 0 to 3, processKey 12884911619
Line 577: [08/23 09:41:58.243] 1613 info: Job 281, Task 0.0: filters/OnJobEnd.sh not detected, skip
Line 598: [08/23 09:41:58.272] 1613 info: Job 281, Task 999.999: EndJob: starting
Line 600: [08/23 09:41:58.299] 1622 info: Job 281, Task 9731.4: filters/OnJobTaskStart.sh not detected, skip
Line 607: [08/23 09:41:58.392] 8825 info: Job 281, Task 9731.3: Process 9040: exit code 137
Line 608: [08/23 09:41:58.416] 1613 info: Job 281, Task 999.999: EndJob: ended {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":3,"UserProcessorTime":10,"WorkingSet":3316}]}
Line 609: [08/23 09:41:58.416] 1613 info: Job 281, Task 999.999: EndJob: Cleanup user root
Line 610: [08/23 09:41:58.416] 1613 info: Job 281, Task 999.999: EndJob: 9 jobs associated with the user root
Line 619: [08/23 09:41:58.465] 1622 info: Job 281, Task 9731.4: Change requeue count from 0 to 4, processKey 17179878915
Line 627: [08/23 09:41:58.477] 1617 info: Job 281, Task 0.0: filters/OnJobEnd.sh not detected, skip
Line 631: [08/23 09:41:58.536] 1623 info: Job 281, Task 9731.5: filters/OnJobTaskStart.sh not detected, skip
Line 640: [08/23 09:41:58.666] 1617 info: Job 281, Task 999.999: EndJob: starting
Line 641: [08/23 09:41:58.682] 9526 info: Job 281, Task 9731.4: Process 9638: exit code 137
Line 642: [08/23 09:41:58.705] 1617 info: Job 281, Task 999.999: EndJob: ended {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":10,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":4,"UserProcessorTime":40,"WorkingSet":6996}]}
Line 643: [08/23 09:41:58.705] 1617 info: Job 281, Task 999.999: EndJob: Cleanup user root
Line 644: [08/23 09:41:58.705] 1617 info: Job 281, Task 999.999: EndJob: 7 jobs associated with the user root
Line 650: [08/23 09:41:58.710] 1623 info: Job 281, Task 9731.5: Change requeue count from 0 to 5, processKey 21474846211
Line 661: [08/23 09:41:58.772] 1636 info: Job 281, Task 0.0: filters/OnJobEnd.sh not detected, skip
Line 662: [08/23 09:41:58.772] 1636 info: Job 281, Task 999.999: EndJob: starting
Line 668: [08/23 09:41:58.862] 1623 info: Job 281, Task 9731.6: filters/OnJobTaskStart.sh not detected, skip
Line 669: [08/23 09:41:59.026] 1636 warning: Job 281, Task 9731.5: The task didn't exit within 1s, process Ids 10113 10132 10135 10138 10160 10207 10208 10211 10225 10242 10450 10451
Line 670: [08/23 09:41:59.026] 1636 info: Job 281, Task 999.999: EndJob: ended {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":0,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":12,"PrimaryTask":1,"ProcessIds":"10113,10132,10135,10138,10160,10207,10208,10211,10225,10242,10450,10451","TaskId":9731,"TaskRequeueCount":5,"UserProcessorTime":170,"WorkingSet":11488}]}
Line 671: [08/23 09:41:59.026] 1636 info: Job 281, Task 999.999: EndJob: Cleanup user root
Line 672: [08/23 09:41:59.026] 1636 info: Job 281, Task 999.999: EndJob: 8 jobs associated with the user root
Line 678: [08/23 09:41:59.036] 1623 info: Job 281, Task 9731.6: Change requeue count from 0 to 6, processKey 25769813507
Line 683: [08/23 09:41:59.112] 1602 info: Job 281, Task 0.0: filters/OnJobEnd.sh not detected, skip
Line 684: [08/23 09:41:59.112] 1602 info: Job 281, Task 999.999: EndJob: starting
Line 686: [08/23 09:41:59.168] 1613 info: Job 281, Task 9731.7: filters/OnJobTaskStart.sh not detected, skip
Line 687: [08/23 09:41:59.231] 10469 info: Job 281, Task 9731.6: Process 10550: exit code 137
Line 688: [08/23 09:41:59.254] 1602 info: Job 281, Task 999.999: EndJob: ended {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":10,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":6,"UserProcessorTime":0,"WorkingSet":3404}]}
Line 689: [08/23 09:41:59.254] 1602 info: Job 281, Task 999.999: EndJob: Cleanup user root
Line 690: [08/23 09:41:59.254] 1602 info: Job 281, Task 999.999: EndJob: 9 jobs associated with the user root
Line 693: [08/23 09:41:59.260] 1633 info: Job 281, Task 9731.7: Change requeue count from 0 to 7, processKey 30064780803
Line 697: [08/23 09:41:59.334] 1641 info: Job 281, Task 0.0: filters/OnJobEnd.sh not detected, skip
Line 698: [08/23 09:41:59.334] 1641 info: Job 281, Task 999.999: EndJob: starting
Line 699: [08/23 09:41:59.356] 10691 info: Job 281, Task 9731.7: Process 10780: exit code 137
Line 700: [08/23 09:41:59.373] 1641 info: Job 281, Task 999.999: EndJob: ended {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":7,"UserProcessorTime":0,"WorkingSet":2560}]}
Line 701: [08/23 09:41:59.373] 1641 info: Job 281, Task 999.999: EndJob: Cleanup user root
Line 702: [08/23 09:41:59.373] 1641 info: Job 281, Task 999.999: EndJob: 9 jobs associated with the user root
Line 705: [08/23 09:41:59.422] 1615 info: Job 281, Task 9731.8: filters/OnJobTaskStart.sh not detected, skip
Line 707: [08/23 09:41:59.428] 1603 info: Job 281, Task 9731.8: Change requeue count from 0 to 8, processKey 34359748099
Line 710: [08/23 09:41:59.434] 1634 info: Job 281, Task 0.0: filters/OnJobEnd.sh not detected, skip
Line 711: [08/23 09:41:59.434] 1634 info: Job 281, Task 999.999: EndJob: starting
Line 714: [08/23 09:41:59.492] 1607 info: Job 281, Task 9731.9: filters/OnJobTaskStart.sh not detected, skip
Line 715: [08/23 09:41:59.494] 1634 info: Job 281, Task 999.999: EndJob: ended {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":8,"UserProcessorTime":0,"WorkingSet":0}]}
Line 716: [08/23 09:41:59.494] 1634 info: Job 281, Task 999.999: EndJob: Cleanup user root
Line 717: [08/23 09:41:59.494] 1634 info: Job 281, Task 999.999: EndJob: 9 jobs associated with the user root
Line 720: [08/23 09:41:59.499] 1633 info: Job 281, Task 9731.9: Change requeue count from 0 to 9, processKey 38654715395
Line 724: [08/23 09:41:59.563] 1620 info: Job 281, Task 0.0: filters/OnJobEnd.sh not detected, skip
Line 725: [08/23 09:41:59.563] 1620 info: Job 281, Task 999.999: EndJob: starting
Line 727: [08/23 09:41:59.619] 1628 info: Job 281, Task 9731.10: filters/OnJobTaskStart.sh not detected, skip
Line 728: [08/23 09:41:59.681] 11072 info: Job 281, Task 9731.9: Process 11130: exit code 137
Line 729: [08/23 09:41:59.704] 1620 info: Job 281, Task 999.999: EndJob: ended {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":10,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":9,"UserProcessorTime":0,"WorkingSet":3404}]}
Line 730: [08/23 09:41:59.704] 1620 info: Job 281, Task 999.999: EndJob: Cleanup user root
Line 731: [08/23 09:41:59.704] 1620 info: Job 281, Task 999.999: EndJob: 9 jobs associated with the user root
Line 734: [08/23 09:41:59.709] 1628 info: Job 281, Task 9731.10: Change requeue count from 0 to 10, processKey 42949682691
Line 738: [08/23 09:41:59.786] 1622 info: Job 281, Task 0.0: filters/OnJobEnd.sh not detected, skip
Line 739: [08/23 09:41:59.786] 1622 info: Job 281, Task 999.999: EndJob: starting
Line 740: [08/23 09:41:59.905] 11259 info: Job 281, Task 9731.10: Process 11349: exit code 137
Line 741: [08/23 09:41:59.927] 1622 info: Job 281, Task 999.999: EndJob: ended {"JobId":281,"Tasks":[{"ExitCode":170,"Exited":1,"KernelProcessorTime":0,"Message":"","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":9731,"TaskRequeueCount":10,"UserProcessorTime":10,"WorkingSet":3404}]}
Line 742: [08/23 09:41:59.927] 1622 info: Job 281, Task 999.999: EndJob: Cleanup user root
Line 743: [08/23 09:41:59.927] 1622 info: Job 281, Task 999.999: EndJob: 9 jobs associated with the user root
Line 1885: [08/23 09:46:39.433] 9904 info: Job 281, Task 9731.5: Process 10063: exit code 0
Line 1892: [08/23 09:46:39.557] 1618 info: Job 281, Task 9731.5: Callback to https://IP-AC130210:443/HpcLinux/api/IP-172-19-3-105/taskcompleted response code 200
Line 1901: [08/23 09:46:40.763] 10958 info: Job 281, Task 9731.8: Process 11052: exit code 0

Job 280, parametric sweep 1.1 - 1.85, task 1.1 executed successfully and according to the log, the headnode was notified that it ran successfuly, and yet HPC reported it as running for 2 hours before cancelling the task. The compute node logs shows that the task was started, then killed, then started again, however the requeue count is incorrect which leads to incorrect state on the headnode:

Line 120: [08/23 09:41:54.255] 1624 info: Job 280, Task 10238.0: filters/OnJobTaskStart.sh not detected, skip
Line 131: [08/23 09:41:54.272] 1602 info: Job 280, Task 10238.0: Change requeue count from 0 to 0, processKey 10238
Line 249: [08/23 09:41:56.529] 1624 info: Job 280, Task 10238.1: filters/OnJobTaskStart.sh not detected, skip
Line 297: [08/23 09:41:56.888] 3129 info: Job 280, Task 10238.0: Process 3362: exit code 137
Line 325: [08/23 09:41:57.081] 1639 info: Job 280, Task 10238.2: filters/OnJobTaskStart.sh not detected, skip
Line 488: [08/23 09:41:57.739] 1635 info: Job 280, Task 10238.2: Change requeue count from 0 to 2, processKey 8589944830
Line 523: [08/23 09:41:57.962] 1620 info: Job 280, Task 10238.1: Change requeue count from 2 to 1, processKey 8589944830
Line 524: [08/23 09:41:57.962] 1620 warning: Job 280, Task 10238.1: The task has started already.
Line 1252: [08/23 09:45:48.745] 7845 info: Job 280, Task 10238.2: Process 8034: exit code 0
Line 1254: [08/23 09:45:48.836] 1635 info: Job 280, Task 10238.1: Callback to https://IP-AC130210:443/HpcLinux/api/IP-172-19-3-105/taskcompleted response code 200

Requeue #1 isn't what the headnode state is, it is expecting requeue count #2 to complete or fail and therefore ignores the message and leaves the task state as running. The compute node will never receive another message for this task from the headnode.

Line 116693: 08/23/202109:45:47.612vHpcScheduler25485144[RC] Job 280, Task 10238, Resource 0, [42] MonitorThread successfully run task monitor.  
Line 116787: 08/23/202109:45:47.973vHpcScheduler25485144[RC] Job 280, Task 10238, Resource 0, [42] MonitorThread successfully run task monitor.  
Line 117246: 08/23/202109:45:48.350vHpcScheduler25488740[RC] Job 280, Task 10238, Resource 0, JobMonitor.SendEvent FinishTask for job 280 currentMonitorTransaction is null ? True threadid 49. Ignore = False  
Line 117253: 08/23/202109:45:48.350vHpcScheduler25485144[RC] Job 280, Task 10238, Resource 0, Task 280.10238 processevent FinishTask  
Line 117254: 08/23/202109:45:48.350iHpcScheduler25485144[RC] Job 280, Task 10238, Resource 0, [TaskMonitor.OnFinish] Task finish event with taskRequeueCount 1 for task 280.10238  
Line 117255: 08/23/202109:45:48.350iHpcScheduler25485144[RC] Job 280, Task 10238, Resource 0, [TaskMonitor.OnFinish] Task finish event with taskRequeueCount 1 is not for the current running task 280.10238 and will be ignored.  
Line 117256: 08/23/202109:45:48.350vHpcScheduler25485144[RC] Job 280, Task 10238, Resource 0, [42] MonitorThread successfully run task monitor.  
Line 117381: 08/23/202109:45:48.855vHpcScheduler25485144[RC] Job 280, Task 10238, Resource 0, [42] MonitorThread successfully run task monitor.  
dterziev commented 3 years ago

After reaching out to the HPC Product Group via the enterprise support channels, we got a private fix for this issue.