Dws operator crashes on 500 workflows

roehrich-hpe commented 1 year ago

The dws-operator-controller-manager went into CrashLoopBackoff when 500 workflows were created. It was killed by the OOM killer. This is reported by @behlendorf .

The deployment specifies a 30Mi memory resource limit.

roehrich-hpe commented 1 year ago

On my cluster I had to lower the resource limit from 30Mi to 20Mi, then this is the kernel's OOM log:


172.30.223.193: kern: warning: [2023-04-14T15:37:18.976992172Z]: manager invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=999
172.30.223.193: kern: warning: [2023-04-14T15:37:19.081299172Z]: CPU: 30 PID: 61398 Comm: manager Not tainted 5.15.92-talos #1
172.30.223.193: kern: warning: [2023-04-14T15:37:19.163644172Z]: Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0010.C0001.010620200716 01/06/2020
172.30.223.193: kern: warning: [2023-04-14T15:37:19.295923172Z]: Call Trace:
172.30.223.193: kern: warning: [2023-04-14T15:37:19.326289172Z]:  <TASK>
172.30.223.193: kern: warning: [2023-04-14T15:37:19.352493172Z]:  dump_stack_lvl+0x45/0x5b
172.30.223.193: kern: warning: [2023-04-14T15:37:19.397418172Z]:  dump_header+0x4a/0x1f5
172.30.223.193: kern: warning: [2023-04-14T15:37:19.440261172Z]:  oom_kill_process.cold+0xb/0x10
172.30.223.193: kern: warning: [2023-04-14T15:37:19.491424172Z]:  out_of_memory+0x27a/0x510
172.30.223.193: kern: warning: [2023-04-14T15:37:19.537389172Z]:  mem_cgroup_out_of_memory+0x138/0x150
172.30.223.193: kern: warning: [2023-04-14T15:37:19.594793172Z]:  try_charge_memcg+0x70c/0x7c0
172.30.223.193: kern: warning: [2023-04-14T15:37:19.643873172Z]:  charge_memcg+0x25/0xa0
172.30.223.193: kern: warning: [2023-04-14T15:37:19.686720172Z]:  __mem_cgroup_charge+0x28/0x80
172.30.223.193: kern: warning: [2023-04-14T15:37:19.745058172Z]:  __handle_mm_fault+0x583/0xc00
172.30.223.193: kern: warning: [2023-04-14T15:37:19.795182172Z]:  handle_mm_fault+0xcb/0x2b0
172.30.223.193: kern: warning: [2023-04-14T15:37:19.842184172Z]:  do_user_addr_fault+0x1b1/0x640
172.30.223.193: kern: warning: [2023-04-14T15:37:19.893349172Z]:  exc_page_fault+0x67/0x130
172.30.223.193: kern: warning: [2023-04-14T15:37:19.939311172Z]:  asm_exc_page_fault+0x22/0x30
172.30.223.193: kern: warning: [2023-04-14T15:37:19.988397172Z]: RIP: 0033:0x46a672
172.30.223.193: kern: warning: [2023-04-14T15:37:20.026038172Z]: Code: 00 01 00 00 48 81 c7 00 01 00 00 48 81 fb 00 01 00 00 0f 83 6e ff ff ff e9 e1 fe ff ff c5 f9 ef c0 48 81 fb 00 00 00 02 73 6f <c5> fe 7f 07 c5 fe 7f 47 20 c5 fe 7f 47 40 c5 fe 7f 47 60 48 81 eb
172.30.223.193: kern: warning: [2023-04-14T15:37:20.252010172Z]: RSP: 002b:000000c000e4b8f0 EFLAGS: 00010287
172.30.223.193: kern: warning: [2023-04-14T15:37:20.315654172Z]: RAX: 0000000000000000 RBX: 0000000000000400 RCX: 0000000000000001
172.30.223.193: kern: warning: [2023-04-14T15:37:20.402172172Z]: RDX: 000000c00101f000 RSI: 000000c00050c800 RDI: 000000c00101f000
172.30.223.193: kern: warning: [2023-04-14T15:37:20.488695172Z]: RBP: 000000c000e4b960 R08: 00007f30c730b108 R09: 0000000000000001
172.30.223.193: kern: warning: [2023-04-14T15:37:20.575213172Z]: R10: 0000000000000000 R11: 0000000000000400 R12: 00007f309fbb70b8
172.30.223.193: kern: warning: [2023-04-14T15:37:20.661733172Z]: R13: 07ffffffffffffff R14: 000000c0001c91e0 R15: ffffffffffffffff
172.30.223.193: kern: warning: [2023-04-14T15:37:20.748360172Z]:  </TASK>
172.30.223.193: kern:    info: [2023-04-14T15:37:20.775622172Z]: memory: usage 20504kB, limit 20480kB, failcnt 314
172.30.223.193: kern:    info: [2023-04-14T15:37:20.845490172Z]: swap: usage 0kB, limit 0kB, failcnt 0
172.30.223.193: kern:    info: [2023-04-14T15:37:20.902903172Z]: Memory cgroup stats for /kubepods/burstable/pod4fa70694-3c92-42de-852d-396c14443237/2d9b6c3b686df9e1e0a274014fbdc33c7fe234a3fc5f51e04fe0b10eed0bc371:
172.30.223.193: kern:    info: [2023-04-14T15:37:39.972518172Z]: anon 20180992\x0afile 0\x0akernel_stack 65536\x0apagetables 188416\x0apercpu 1440\x0asock 0\x0ashmem 0\x0afile_mapped 0\x0afile_dirty 0\x0afile_writeback 0\x0aswapcached 0\x0ainactive_anon 20078592\x0aactive_anon 4096\x0ainactive_file 0\x0aactive_file 0\x0aunevictable 0\x0aslab_reclaimable 39064\x0aslab_unreclaimable 72080\x0aslab 111144\x0aworkingset_refault_anon 0\x0aworkingset_refault_file 0\x0aworkingset_activate_anon 0\x0aworkingset_activate_file 0\x0aworkingset_restore_anon 0\x0aworkingset_restore_file 0\x0aworkingset_nodereclaim 0\x0apgfault 7653\x0apgmajfault 0\x0apgrefill 0\x0apgscan 0\x0apgsteal 0\x0apgactivate 0\x0apgdeactivate 0\x0apglazyfree 1\x0apglazyfreed 0
172.30.223.193: kern:    info: [2023-04-14T15:37:41.241807172Z]: Tasks state (memory values in pages):
172.30.223.193: kern:    info: [2023-04-14T15:37:41.299207172Z]: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
172.30.223.193: kern:    info: [2023-04-14T15:37:41.403408172Z]: [  61398] 65532 61369   187150    10882   204800        0           999 manager
172.30.223.193: kern:    info: [2023-04-14T15:37:41.504485172Z]: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=2d9b6c3b686df9e1e0a274014fbdc33c7fe234a3fc5f51e04fe0b10eed0bc371,mems_allowed=0-1,oom_memcg=/kubepods/burstable/pod4fa70694-3c92-42de-852d-396c14443237/2d9b6c3b686df9e1e0a274014fbdc33c7fe234a3fc5f51e04fe0b10eed0bc371,task_memcg=/kubepods/burstable/pod4fa70694-3c92-42de-852d-396c14443237/2d9b6c3b686df9e1e0a274014fbdc33c7fe234a3fc5f51e04fe0b10eed0bc371,task=manager,pid=61369,uid=65532
172.30.223.193: kern:     err: [2023-04-14T15:37:41.986487172Z]: Memory cgroup out of memory: Killed process 61398 (manager) total-vm:748600kB, anon-rss:19524kB, file-rss:24004kB, shmem-rss:0kB, UID:65532 pgtables:200kB oom_score_adj:999```

roehrich-hpe commented 1 year ago

Both my cluster and Brian's cluster have plenty of memory and are not competing with other processes.

He raised his limit to 40Mi to get past the problem: kubectl patch deploy -n dws-operator-system dws-operator-controller-manager --type=json -p '[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/limits/memory", "value": "40Mi"}]'

He increased it to 50Mi to handle 1000 workflows.

On my cluster, I increased it to 100Mi and was able to handle 3000 workflows.

roehrich-hpe commented 1 year ago

I have been gradually increasing my limit and workflow count. I had to go to 100Mi to get to 2000 workflows. And I've been able to leave that limit at 100Mi as I go to 3000 workflows.

roehrich-hpe commented 1 year ago

We've removed the memory limit in https://github.com/HewlettPackard/dws/pull/125

NearNodeFlash / NearNodeFlash.github.io

Dws operator crashes on 500 workflows #41