Docker container start error

I'm starting a bunch of containers to grade student work / generate feedback forms / etc, and I'm seeing this error periodically when I try to start a docker container:

500 Server Error: Internal Server Error ("OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:319: getting the final child\'s pid from pipe caused \\"EOF\\"": unknown")

This gets reported as a DockerAPIError and correctly reported to the instructor. But:

It's not deterministic -- if I run the exact same command again a second time, it doesn't happen
Only seems to happen when I'm creating lots of docker containers (about 177 of them, 6 running at a time)
Adding a 0.25s sleep between container creation seemed (?) to reduce the overall # of these errors but not get rid of them
Reducing the memory per container to 1GB and # concurrent containers to 4 didn't seem to do much
after testing for a while, it seems that once container creation starts throwing the above error once, it starts to do it more often
restarting containerd / docker does not help. Rebooting the VM resolves the issue (not sure if permanently...)

Still experimenting to figure out what's going on here.

Current status of this: I tried adding a 10 second pause after each thrown error + 5 attempts at starting each container. It does seem to help a lot at first (when an attempt fails, it usually fails once or twice and then succeeds...). But after a while starting containers fails more and more and even a 10 second pause between starting containers doesn't help.

Here's /var/log/messages for what looks like one failed container startup...

Based on the runc page allocation failure, I assume this must be due to some memory allocation problem.

https://community.gravitational.com/t/pods-memory-allocation-failures/675

Sep 28 04:07:22 dsci-100-instructor containerd: time="2020-09-28T04:07:22.068981726Z" level=info msg="shim reaped" id=fe471bd97c4404634637fbe566ded616c9f8fd03be7652c8cd097d995e5b7de3
Sep 28 04:07:22 dsci-100-instructor dockerd: time="2020-09-28T04:07:22.079155571Z" level=error msg="stream copy error: reading from a closed fifo"
Sep 28 04:07:22 dsci-100-instructor dockerd: time="2020-09-28T04:07:22.079264570Z" level=error msg="stream copy error: reading from a closed fifo"
Sep 28 04:07:22 dsci-100-instructor kernel: docker0: port 3(veth0c720c7) entered disabled state
Sep 28 04:07:22 dsci-100-instructor kernel: device veth0c720c7 left promiscuous mode
Sep 28 04:07:22 dsci-100-instructor kernel: docker0: port 3(veth0c720c7) entered disabled state
Sep 28 04:07:22 dsci-100-instructor dockerd: time="2020-09-28T04:07:22.129654940Z" level=error msg="fe471bd97c4404634637fbe566ded616c9f8fd03be7652c8cd097d995e5b7de3 cleanup: failed to delete container from containerd: no such container"
Sep 28 04:07:22 dsci-100-instructor dockerd: time="2020-09-28T04:07:22.129688190Z" level=error msg="Handler for POST /v1.40/containers/fe471bd97c4404634637fbe566ded616c9f8fd03be7652c8cd097d995e5b7de3/start returned error: OCI runtime create failed: container_linux.go:349: starting container process caused \"process_linux.go:319: getting the final child's pid from pipe caused \\\"EOF\\\"\": unknown"
Sep 28 04:07:22 dsci-100-instructor kernel: docker0: port 3(veth6225c12) entered blocking state
Sep 28 04:07:22 dsci-100-instructor kernel: docker0: port 3(veth6225c12) entered disabled state
Sep 28 04:07:22 dsci-100-instructor kernel: device veth6225c12 entered promiscuous mode
Sep 28 04:07:22 dsci-100-instructor kernel: IPv6: ADDRCONF(NETDEV_UP): veth6225c12: link is not ready
Sep 28 04:07:22 dsci-100-instructor kernel: docker0: port 3(veth6225c12) entered blocking state
Sep 28 04:07:22 dsci-100-instructor kernel: docker0: port 3(veth6225c12) entered forwarding state
Sep 28 04:07:22 dsci-100-instructor containerd: time="2020-09-28T04:07:22.197835728Z" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/061f223be39171518b02d58bf5285a44ec405c4eea40d2f8a9acc82c1500e251/shim.sock" debug=false pid=20495
Sep 28 04:07:22 dsci-100-instructor kernel: runc:[1:CHILD]: page allocation failure: order:4, mode:0xc0d0
Sep 28 04:07:22 dsci-100-instructor kernel: CPU: 7 PID: 20511 Comm: runc:[1:CHILD] Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-1127.19.1.el7.x86_64 #1
Sep 28 04:07:22 dsci-100-instructor kernel: Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
Sep 28 04:07:22 dsci-100-instructor kernel: Call Trace:
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff9897ffa5>] dump_stack+0x19/0x1b
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff983c4b70>] warn_alloc_failed+0x110/0x180
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff9897b4c0>] __alloc_pages_slowpath+0x6bb/0x729
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff983c91f6>] __alloc_pages_nodemask+0x436/0x450
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff98418ea8>] alloc_pages_current+0x98/0x110
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff983e57c8>] kmalloc_order+0x18/0x40
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff98424466>] kmalloc_order_trace+0x26/0xa0
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff984283f1>] __kmalloc+0x211/0x230
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff984412c1>] memcg_alloc_cache_params+0x81/0xb0
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff983e5474>] do_kmem_cache_create+0x74/0xf0
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff983e55f2>] kmem_cache_create+0x102/0x1b0
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffffc0c1ede0>] nf_conntrack_init_net+0x100/0x270 [nf_conntrack]
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffffc0c1f6e4>] nf_conntrack_pernet_init+0x14/0x150 [nf_conntrack]
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff98848074>] ops_init+0x44/0x150
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff9884823b>] setup_net+0xbb/0x170
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff988489d5>] copy_net_ns+0xb5/0x180
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff982cb909>] create_new_namespaces+0xf9/0x180
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff982cbb4a>] unshare_nsproxy_namespaces+0x5a/0xc0
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff9829b24b>] SyS_unshare+0x1cb/0x340
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff98992ed2>] system_call_fastpath+0x25/0x2a
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff98992e15>] ? system_call_after_swapgs+0xa2/0x13a
Sep 28 04:07:22 dsci-100-instructor kernel: Mem-Info:
Sep 28 04:07:22 dsci-100-instructor kernel: active_anon:296118 inactive_anon:52795 isolated_anon:0#012 active_file:551659 inactive_file:599562 isolated_file:0#012 unevictable:0 dirty:39 writeback:4 unstable:0#012 slab_reclaimable:121496 slab_unreclaimable:279345#012 mapped:44130 shmem:160012 pagetables:4682 bounce:0#012 free:213684 free_pcp:86 free_cma:0
Sep 28 04:07:22 dsci-100-instructor kernel: Node 0 DMA free:15876kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Sep 28 04:07:22 dsci-100-instructor kernel: lowmem_reserve[]: 0 2830 15865 15865
Sep 28 04:07:22 dsci-100-instructor kernel: Node 0 DMA32 free:178732kB min:12044kB low:15052kB high:18064kB active_anon:212540kB inactive_anon:38684kB active_file:314448kB inactive_file:331288kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3129184kB managed:2898752kB mlocked:0kB dirty:8kB writeback:0kB mapped:28592kB shmem:117744kB slab_reclaimable:90764kB slab_unreclaimable:196556kB kernel_stack:1792kB pagetables:2116kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Sep 28 04:07:22 dsci-100-instructor kernel: lowmem_reserve[]: 0 0 13035 13035
Sep 28 04:07:22 dsci-100-instructor kernel: Node 0 Normal free:659448kB min:55468kB low:69332kB high:83200kB active_anon:972192kB inactive_anon:172496kB active_file:1892188kB inactive_file:2066960kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:13631488kB managed:13350860kB mlocked:0kB dirty:148kB writeback:16kB mapped:147928kB shmem:522304kB slab_reclaimable:395220kB slab_unreclaimable:920792kB kernel_stack:6144kB pagetables:16612kB unstable:0kB bounce:0kB free_pcp:948kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Sep 28 04:07:22 dsci-100-instructor kernel: lowmem_reserve[]: 0 0 0 0
Sep 28 04:07:22 dsci-100-instructor kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15876kB
Sep 28 04:07:22 dsci-100-instructor kernel: Node 0 DMA32: 8069*4kB (UEM) 7489*8kB (UEM) 5307*16kB (UEM) 47*32kB (UM) 2*64kB (U) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 178732kB
Sep 28 04:07:22 dsci-100-instructor kernel: Node 0 Normal: 32871*4kB (UEM) 33274*8kB (UEM) 15541*16kB (UEM) 359*32kB (UEM) 14*64kB (UEM) 4*128kB (UE) 0*256kB 1*512kB (U) 0*1024kB 0*2048kB 0*4096kB = 659740kB
Sep 28 04:07:22 dsci-100-instructor kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Sep 28 04:07:22 dsci-100-instructor kernel: 1311233 total pagecache pages
Sep 28 04:07:22 dsci-100-instructor kernel: 0 pages in swap cache
Sep 28 04:07:22 dsci-100-instructor kernel: Swap cache stats: add 0, delete 0, find 0/0
Sep 28 04:07:22 dsci-100-instructor kernel: Free swap  = 0kB
Sep 28 04:07:22 dsci-100-instructor kernel: Total swap = 0kB
Sep 28 04:07:22 dsci-100-instructor kernel: 4194166 pages RAM
Sep 28 04:07:22 dsci-100-instructor kernel: 0 pages HighMem/MovableOnly
Sep 28 04:07:22 dsci-100-instructor kernel: 127786 pages reserved
Sep 28 04:07:22 dsci-100-instructor kernel: kmem_cache_create(nf_conntrack_6025) failed with error -12
Sep 28 04:07:22 dsci-100-instructor kernel: CPU: 7 PID: 20511 Comm: runc:[1:CHILD] Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-1127.19.1.el7.x86_64 #1
Sep 28 04:07:22 dsci-100-instructor kernel: Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
Sep 28 04:07:22 dsci-100-instructor kernel: Call Trace:
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff9897ffa5>] dump_stack+0x19/0x1b
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff983e5677>] kmem_cache_create+0x187/0x1b0
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffffc0c1ede0>] nf_conntrack_init_net+0x100/0x270 [nf_conntrack]
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffffc0c1f6e4>] nf_conntrack_pernet_init+0x14/0x150 [nf_conntrack]
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff98848074>] ops_init+0x44/0x150
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff9884823b>] setup_net+0xbb/0x170
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff988489d5>] copy_net_ns+0xb5/0x180
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff982cb909>] create_new_namespaces+0xf9/0x180
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff982cbb4a>] unshare_nsproxy_namespaces+0x5a/0xc0
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff9829b24b>] SyS_unshare+0x1cb/0x340
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff98992ed2>] system_call_fastpath+0x25/0x2a
Sep 28 04:07:22 dsci-100-instructor kernel: [<ffffffff98992e15>] ? system_call_after_swapgs+0xa2/0x13a
Sep 28 04:07:22 dsci-100-instructor kernel: Unable to create nf_conn slab cache

UBC-DSCI / rudaux

Docker container start error #73