Closed jalbstmeijer closed 2 years ago
LXD isn't in the I/O path of the instances it's running. LXD instructs LXC to instruct the kernel to setup the namespacing. LXD itself can be completely stopped without impacting the instances running on it.
The behavior you're describing matches what you'd expect from the kernel when running out of space in the VFS cache. Rather than being able to hold the data you've read into the VFS cache, the kernel has to evacuate it, requiring you to hit the network again when accessing the data, causing the usage pattern you're seeing.
The fix for this is indeed to provide more memory to the instance so the kernel has more space available to cache the data and doesn't need to fetch it fresh every time.
Required information
The output of "lxc info":
config: core.https_address: '[::]:8443' core.trust_password: true api_extensions:
Issue description
We have seen several times that containers get unresponsive. Looking with htop on the host, we see all processes (or at least many different ones, including init) in the container reading at maximum capacity from the container storage. In the AWS console we see indeed the EBS volume peeking to maximum throughput and stay there. The container stays in this state till restarted. (we left it for hours, don't think it ever stops)
The different containers that showed this behavior are too different and the processes reading with high throughput too random and unexpected (init, nscd, etc) to point to a specific piece of software misbehaving.
It feels more like a race condition in LXC/LXD and there seems to be a relation with memory. We have been able to get the behavior away by just assigning more memory to one of the containers that regularly experienced this issue. But it does not feel like a fix, more like a temporary workaround.
We are assigning dedicated EBS volumes to the containers, using DIR or LVM backend storage, so the impact is limited to the storage of the container having the issue.
We used the snap version of LXD.
Any suggestion what this issue could be and how to collect more useful debug information?
Steps to reproduce
Information to attach
dmesg
)Last affected container also had oom activity during issue.
[Wed May 11 22:33:38 2022] mysqld invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 [Wed May 11 22:33:38 2022] CPU: 3 PID: 5307 Comm: mysqld Not tainted 5.10.109-104.500.amzn2.x86_64 #1 [Wed May 11 22:33:38 2022] Hardware name: Amazon EC2 m5.2xlarge/, BIOS 1.0 10/16/2017 [Wed May 11 22:33:38 2022] Call Trace: [Wed May 11 22:33:38 2022] dump_stack+0x57/0x70 [Wed May 11 22:33:38 2022] dump_header+0x4a/0x1f0 [Wed May 11 22:33:38 2022] oom_kill_process.cold+0xb/0x10 [Wed May 11 22:33:38 2022] out_of_memory+0xe9/0x2e0 [Wed May 11 22:33:38 2022] mem_cgroup_out_of_memory+0x134/0x150 [Wed May 11 22:33:38 2022] try_charge+0x659/0x6e0 [Wed May 11 22:33:38 2022] ? add_to_page_cache_locked+0x397/0x3f0 [Wed May 11 22:33:38 2022] mem_cgroup_charge+0x7f/0x230 [Wed May 11 22:33:38 2022] add_to_page_cache_locked+0x397/0x3f0 [Wed May 11 22:33:38 2022] ? scan_shadow_nodes+0x30/0x30 [Wed May 11 22:33:38 2022] add_to_page_cache_lru+0x48/0xc0 [Wed May 11 22:33:38 2022] pagecache_get_page+0x14d/0x310 [Wed May 11 22:33:38 2022] filemap_fault+0x4e0/0x6b0 [Wed May 11 22:33:38 2022] ext4_filemap_fault+0x2d/0x40 [ext4] [Wed May 11 22:33:38 2022] do_fault+0x38/0xe0 [Wed May 11 22:33:38 2022] do_fault+0x98/0x240 [Wed May 11 22:33:38 2022] handle_mm_fault+0x333/0x640 [Wed May 11 22:33:38 2022] handle_mm_fault+0xbe/0x290 [Wed May 11 22:33:38 2022] do_user_addr_fault+0x1b3/0x3e0 [Wed May 11 22:33:38 2022] exc_page_fault+0x68/0x120 [Wed May 11 22:33:38 2022] ? asm_exc_page_fault+0x8/0x30 [Wed May 11 22:33:38 2022] asm_exc_page_fault+0x1e/0x30 [Wed May 11 22:33:38 2022] RIP: 0033:0x555adb63efdb [Wed May 11 22:33:38 2022] Code: Unable to access opcode bytes at RIP 0x555adb63efb1. [Wed May 11 22:33:38 2022] RSP: 002b:00007f89a37fd750 EFLAGS: 00010246 [Wed May 11 22:33:38 2022] RAX: 00007f89a42f18d0 RBX: 00007f89a42f18d0 RCX: 00007f89a42f38c0 [Wed May 11 22:33:38 2022] RDX: 00007f89a42f38c0 RSI: 0000000000000000 RDI: 00007f89a42f18d0 [Wed May 11 22:33:38 2022] RBP: 00007f89a37fd980 R08: 00007f89a37fd790 R09: 00007f89a42f18d0 [Wed May 11 22:33:38 2022] R10: 00007f89a42f18d0 R11: 0000000000000206 R12: 0000000000000100 [Wed May 11 22:33:38 2022] R13: 00007f89a37fdbf0 R14: 0000000000000000 R15: 00007f89a37fdb80 [Wed May 11 22:33:38 2022] memory: usage 1827188kB, limit 1953124kB, failcnt 0 [Wed May 11 22:33:38 2022] memory+swap: usage 1953124kB, limit 1953124kB, failcnt 9064697 [Wed May 11 22:33:38 2022] kmem: usage 55136kB, limit 9007199254740988kB, failcnt 0 [Wed May 11 22:33:38 2022] Memory cgroup stats for /lxc.payload.xxx: [Wed May 11 22:33:38 2022] anon 1184612352 file 616366080 kernel_stack 2850816 percpu 0 sock 0 shmem 612986880 file_mapped 36089856 file_dirty 135168 file_writeback 0 anon_thp 0 inactive_anon 1332809728 active_anon 477462528 inactive_file 2936832 active_file 0 unevictable 0 slab_reclaimable 5610624 slab_unreclaimable 15516088 slab 21126712 workingset_refault_anon 561 workingset_refault_file 12966822 workingset_activate_anon 429 workingset_activate_file 12552870 workingset_restore_anon 0 workingset_restore_file 12368829 workingset_nodereclaim 0 pgfault 7585541403 pgmajfault 622314 pgrefill 28735633 pgscan 1350598631 pgsteal 14224940 pgactivate 22778547 pgdeactivate 23395520 pglazyfree 0 pglazyfreed 0 thp_fault_alloc 0 thp_collapse_alloc 0 [Wed May 11 22:33:38 2022] Tasks state (memory values in pages): [Wed May 11 22:33:38 2022] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [Wed May 11 22:33:38 2022] [ 9854] 1000000 9854 49041 557 147456 54 0 systemd [Wed May 11 22:33:38 2022] [ 11341] 1000081 11341 14560 103 151552 74 0 dbus-daemon [Wed May 11 22:33:38 2022] [ 11386] 1000028 11386 336191 198 327680 48 0 nscd [Wed May 11 22:33:38 2022] [ 11385] 1000000 11385 117479 221 360448 277 0 NetworkManager [Wed May 11 22:33:38 2022] [ 11301] 1000000 11301 22041 11681 217088 27 0 systemd-journal [Wed May 11 22:33:38 2022] [ 11364] 1000000 11364 6198 110 90112 30 0 systemd-logind [Wed May 11 22:33:38 2022] [ 11369] 1000000 11369 31600 38 106496 119 0 crond [Wed May 11 22:33:38 2022] [ 11324] 1000000 11324 10386 5 114688 93 0 systemd-udevd [Wed May 11 22:33:38 2022] [ 11533] 1000999 11533 156220 117 278528 769 0 polkitd [Wed May 11 22:33:38 2022] [ 12795] 1000000 12795 144620 8840 565248 91 0 rsyslogd [Wed May 11 22:33:38 2022] [ 12781] 1000000 12781 27878 35 253952 219 0 sshd [Wed May 11 22:33:38 2022] [ 17249] 1000000 17249 8448 42 102400 0 0 sshd [Wed May 11 22:33:38 2022] [ 30078] 1000000 30078 27879 39 241664 216 0 sshd [Wed May 11 22:33:38 2022] [ 13034] 1000996 13034 95301 7545 212992 133 0 memcached [Wed May 11 22:33:38 2022] [ 12956] 1000996 12956 94294 6320 212992 175 0 memcached [Wed May 11 22:33:38 2022] [ 13193] 1001003 13193 117126 2642 602112 453 0 php-fpm [Wed May 11 22:33:38 2022] [ 13194] 1001003 13194 117157 2659 602112 453 0 php-fpm [Wed May 11 22:33:38 2022] [ 8961] 1001003 8961 117121 2641 602112 454 0 php-fpm [Wed May 11 22:33:38 2022] [ 5102] 1001003 5102 130595 15743 708608 450 0 php-fpm [Wed May 11 22:33:38 2022] [ 7853] 1001003 7853 130603 15439 708608 450 0 php-fpm [Wed May 11 22:33:38 2022] [ 7864] 1001003 7864 131111 15899 712704 450 0 php-fpm [Wed May 11 22:33:38 2022] [ 10342] 1001003 10342 131597 16549 716800 450 0 php-fpm [Wed May 11 22:33:38 2022] [ 10802] 1001003 10802 130586 15520 708608 450 0 php-fpm [Wed May 11 22:33:38 2022] [ 15949] 1001003 15949 129028 14119 696320 451 0 php-fpm [Wed May 11 22:33:38 2022] [ 16020] 1001003 16020 68485 656 524288 567 0 php-fpm [Wed May 11 22:33:38 2022] [ 16498] 1000000 16498 22132 28 192512 228 0 master [Wed May 11 22:33:38 2022] [ 31701] 1000089 31701 22107 249 196608 0 0 showq [Wed May 11 22:33:38 2022] [ 15070] 1000089 15070 22142 259 204800 0 0 cleanup [Wed May 11 22:33:38 2022] [ 15072] 1000089 15072 22142 239 196608 0 0 cleanup [Wed May 11 22:33:38 2022] [ 15074] 1000000 15074 22132 264 204800 0 0 local [Wed May 11 22:33:38 2022] [ 15081] 1000000 15081 22132 238 200704 0 0 local [Wed May 11 22:33:38 2022] [ 32501] 1000089 32501 22115 256 200704 0 0 bounce [Wed May 11 22:33:38 2022] [ 32502] 1000089 32502 22115 237 196608 0 0 bounce [Wed May 11 22:33:38 2022] [ 8920] 1000089 8920 22106 252 200704 0 0 trivial-rewrite [Wed May 11 22:33:38 2022] [ 8933] 1000089 8933 22105 252 196608 0 0 pickup [Wed May 11 22:33:38 2022] [ 25373] 1000000 25373 62324 755 532480 33 0 httpd [Wed May 11 22:33:38 2022] [ 8879] 1000000 8879 1061 18 49152 0 0 cronolog [Wed May 11 22:33:38 2022] [ 8880] 1000000 8880 1095 25 57344 0 0 cronolog [Wed May 11 22:33:38 2022] [ 8881] 1000000 8881 1063 19 49152 0 0 cronolog [Wed May 11 22:33:38 2022] [ 8882] 1000000 8882 1094 24 49152 0 0 cronolog [Wed May 11 22:33:38 2022] [ 8883] 1000000 8883 1096 26 57344 0 0 cronolog [Wed May 11 22:33:38 2022] [ 8884] 1000000 8884 1095 25 53248 0 0 cronolog [Wed May 11 22:33:38 2022] [ 8885] 1000000 8885 1095 25 57344 0 0 cronolog [Wed May 11 22:33:38 2022] [ 8886] 1000000 8886 1095 25 53248 0 0 cronolog [Wed May 11 22:33:38 2022] [ 8887] 1000048 8887 62324 711 520192 33 0 httpd [Wed May 11 22:33:38 2022] [ 24042] 1000048 24042 62448 939 528384 33 0 httpd [Wed May 11 22:33:38 2022] [ 17814] 1000048 17814 62429 941 528384 33 0 httpd [Wed May 11 22:33:38 2022] [ 9155] 1000048 9155 62429 945 528384 33 0 httpd [Wed May 11 22:33:38 2022] [ 28010] 1000048 28010 62423 941 528384 33 0 httpd [Wed May 11 22:33:38 2022] [ 28503] 1000048 28503 62424 943 528384 33 0 httpd [Wed May 11 22:33:38 2022] [ 20461] 1000048 20461 62421 936 528384 33 0 httpd [Wed May 11 22:33:38 2022] [ 20482] 1000048 20482 62423 932 528384 33 0 httpd [Wed May 11 22:33:38 2022] [ 20483] 1000048 20483 62428 937 528384 33 0 httpd [Wed May 11 22:33:38 2022] [ 20484] 1000048 20484 62421 944 528384 33 0 httpd [Wed May 11 22:33:38 2022] [ 20485] 1000048 20485 62423 932 528384 33 0 httpd [Wed May 11 22:33:38 2022] [ 8723] 1000048 8723 62429 939 528384 33 0 httpd [Wed May 11 22:33:38 2022] [ 28888] 1000048 28888 62423 926 528384 33 0 httpd [Wed May 11 22:33:38 2022] [ 4007] 1000048 4007 62421 925 528384 33 0 httpd [Wed May 11 22:33:38 2022] [ 26608] 1000048 26608 62425 933 528384 33 0 httpd [Wed May 11 22:33:38 2022] [ 27363] 1000048 27363 62429 938 528384 33 0 httpd [Wed May 11 22:33:38 2022] [ 854] 1000048 854 62420 933 528384 33 0 httpd [Wed May 11 22:33:38 2022] [ 7819] 1000048 7819 62397 924 528384 33 0 httpd [Wed May 11 22:33:38 2022] [ 8062] 1000048 8062 62430 931 528384 33 0 httpd [Wed May 11 22:33:38 2022] [ 18861] 1000048 18861 62429 929 528384 33 0 httpd [Wed May 11 22:33:38 2022] [ 9725] 1000048 9725 62423 897 528384 33 0 httpd [Wed May 11 22:33:38 2022] [ 18164] 1000000 18164 62324 712 512000 33 0 httpd [Wed May 11 22:33:39 2022] [ 19704] 1000000 19704 62324 712 512000 33 0 httpd [Wed May 11 22:33:39 2022] [ 4927] 1000027 4927 450771 94890 1228800 0 0 mysqld [Wed May 11 22:33:39 2022] [ 8968] 1000000 8968 44641 1150 204800 97 0 crond [Wed May 11 22:33:39 2022] [ 8980] 1001003 8980 76441 11081 614400 0 0 php [Wed May 11 22:33:39 2022] [ 9343] 1001003 9343 22096 237 204800 0 0 sendmail [Wed May 11 22:33:39 2022] [ 9344] 1001003 9344 28324 37 77824 0 0 sh [Wed May 11 22:33:39 2022] [ 9348] 1001003 9348 22089 236 204800 0 0 postdrop [Wed May 11 22:33:39 2022] [ 9351] 1001003 9351 83462 18251 684032 0 0 php [Wed May 11 22:33:39 2022] [ 16007] 1001003 16007 29401 61 86016 0 0 git [Wed May 11 22:33:39 2022] [ 16009] 1001003 16009 44352 233 188416 0 0 ssh [Wed May 11 22:33:39 2022] [ 8967] 1000000 8967 44641 1150 204800 97 0 crond [Wed May 11 22:33:39 2022] [ 8977] 1001003 8977 76447 11100 622592 0 0 php [Wed May 11 22:33:39 2022] [ 9318] 1001003 9318 22096 238 200704 0 0 sendmail [Wed May 11 22:33:39 2022] [ 9320] 1001003 9320 28324 37 77824 0 0 sh [Wed May 11 22:33:39 2022] [ 9336] 1001003 9336 85038 19365 696320 0 0 php [Wed May 11 22:33:39 2022] [ 9337] 1001003 9337 22089 236 200704 0 0 postdrop [Wed May 11 22:33:39 2022] [ 16062] 1001003 16062 29188 92 77824 0 0 git [Wed May 11 22:33:39 2022] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=lxc.payload.xxx,mems_allowed=0,oom_memcg=/lxc.payload.xxx,task_memcg=/lxc.payload.xxx,task=mysqld,pid=4927,uid=1000027 [Wed May 11 22:33:39 2022] Memory cgroup out of memory: Killed process 4927 (mysqld) total-vm:1803084kB, anon-rss:379560kB, file-rss:0kB, shmem-rss:0kB, UID:1000027 pgtables:1200kB oom_score_adj:0 [Wed May 11 22:33:39 2022] oom_reaper: reaped process 4927 (mysqld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
lxc info NAME --show-log
)Name: xxx Status: RUNNING Type: container Architecture: x86_64 PID: 6295 Created: 2021/11/15 13:07 CET Last Used: 2022/05/12 06:31 CEST
Resources: Processes: 146 CPU usage: CPU usage (in seconds): 5622 Memory usage: Memory (current): 1.08GiB Memory (peak): 1.82GiB Network usage: lo: Type: loopback State: UP MTU: 65536 Bytes received: 12.91GB Bytes sent: 12.91GB Packets received: 18902200 Packets sent: 18902200 IP addresses: inet: 127.0.0.1/8 (local) inet6: ::1/128 (local) eth0: Type: broadcast State: UP Host interface: vethd6f28586 MAC address: xxx MTU: 9001 Bytes received: 5.62GB Bytes sent: 968.11MB Packets received: 1778932 Packets sent: 1478822 IP addresses: inet: xxx/24 (global) inet6: xxx/64 (link)
Log:
lxc xxx 20220512043147.329 WARN conf - conf.c:lxc_map_ids:3592 - newuidmap binary is missing lxc xxx 20220512043147.329 WARN conf - conf.c:lxc_map_ids:3598 - newgidmap binary is missing lxc xxx 20220512043147.330 WARN conf - conf.c:lxc_map_ids:3592 - newuidmap binary is missing lxc xxx 20220512043147.330 WARN conf - conf.c:lxc_map_ids:3598 - newgidmap binary is missing lxc xxx 20220512043154.168 WARN attach - attach.c:get_attach_context:477 - No security context received
lxc config show NAME --expanded
)architecture: x86_64 config: boot.autostart: "true" image.architecture: amd64 image.description: Centos 7 amd64 (20211115_07:08) image.os: Centos image.release: "7" image.serial: "20211115_07:08" image.type: squashfs image.variant: default limits.cpu: "2" limits.memory: 2000MB volatile.base_image: ec211b8b9f08f9813a8be0192025525e6f1bd08930f6c0bba35d8900c85f7151 volatile.eth0.host_name: vethd6f28586 volatile.eth0.hwaddr: xxx volatile.idmap.base: "0" volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]' volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]' volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]' volatile.last_state.power: RUNNING volatile.uuid: 9ec90aa2-1e62-44e4-98d6-119719b2c425 devices: eth0: name: eth0 nictype: bridged parent: br1 type: nic root: path: / pool: xxx type: disk ephemeral: false profiles:
default stateful: false description: ""
This seems to be when restarting the container to get it working again
time="2022-05-12T06:24:55+02:00" level=error msg="Failed to retrieve PID of executing child process" instance=xxx instanceType=container project=default time="2022-05-12T06:24:55+02:00" level=error msg="Failed to retrieve PID of executing child process" instance=xxx instanceType=container project=default