coreos / bugs

Issue tracker for CoreOS Container Linux
https://coreos.com/os/eol/
147 stars 30 forks source link

rapid OOMKills cause root fs to be remounted readonly #2586

Closed redbaron closed 5 years ago

redbaron commented 5 years ago

Bug

It seems that multiple OOMkillers invoked in close succession cause ext4 filesystem to be remounted readonly due to journal corruption.

It was producing coredumps in parallel, so it might be also contributing to the problem.

May 29 01:38:43 erepnk10 kernel: celery invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null),  order=0, oom_score_adj=993
May 29 01:38:43 erepnk10 kernel: celery cpuset=4e75f39b973322cdd9f64cdb0f3634525a7059253e4065f0ff2be0077b00eb4f mems_allowed=0-1
May 29 01:38:43 erepnk10 kernel: CPU: 1 PID: 27714 Comm: celery Not tainted 4.14.84-coreos #1
May 29 01:38:43 erepnk10 kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
May 29 01:38:43 erepnk10 kernel: Call Trace:
May 29 01:38:43 erepnk10 kernel:  dump_stack+0x5c/0x85
May 29 01:38:43 erepnk10 kernel:  dump_header+0x94/0x229
May 29 01:38:43 erepnk10 kernel:  oom_kill_process+0x213/0x410
May 29 01:38:43 erepnk10 kernel:  out_of_memory+0x2ab/0x4c0
May 29 01:38:43 erepnk10 kernel:  mem_cgroup_out_of_memory+0x49/0x80
May 29 01:38:43 erepnk10 kernel:  mem_cgroup_oom_synchronize+0x2ed/0x330
May 29 01:38:43 erepnk10 kernel:  ? mem_cgroup_css_online+0x30/0x30
May 29 01:38:43 erepnk10 kernel:  pagefault_out_of_memory+0x32/0x77
May 29 01:38:43 erepnk10 kernel:  __do_page_fault+0x4b3/0x4c0
May 29 01:38:43 erepnk10 kernel:  ? page_fault+0x2f/0x50
May 29 01:38:43 erepnk10 kernel:  page_fault+0x45/0x50
May 29 01:38:43 erepnk10 kernel: RIP: 0000:0x4d1030
May 29 01:38:43 erepnk10 kernel: RSP: f7646280:000055ccf7a872b0 EFLAGS: 7f6bf0622b00
May 29 01:38:43 erepnk10 kernel: Task in /kubepods/burstable/pod83a2d7bf-815a-11e9-bacb-005056bced2e/4e75f39b973322cdd9f64cdb0f3634525a7059253e4065f0ff2be0077b00eb4f killed as a result of limit of /kubepods/burstable/pod83a2d7bf-815a-11e9-bacb-005056bced2e/4e75f39b973322cdd9f64cdb0f3634525a7059253e4065f0ff2be0077b00eb4f
May 29 01:38:43 erepnk10 kernel: memory: usage 524288kB, limit 524288kB, failcnt 1
May 29 01:38:43 erepnk10 kernel: memory+swap: usage 524288kB, limit 524288kB, failcnt 187653
May 29 01:38:43 erepnk10 kernel: kmem: usage 15596kB, limit 9007199254740988kB, failcnt 0
May 29 01:38:43 erepnk10 kernel: Memory cgroup stats for /kubepods/burstable/pod83a2d7bf-815a-11e9-bacb-005056bced2e/4e75f39b973322cdd9f64cdb0f3634525a7059253e4065f0ff2be0077b00eb4f: cache:20KB rss:508672KB rss_huge:28672KB shmem:0KB mapped_file:4KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:508672KB inactive_file:16KB active_file:4KB unevictable:0KB
May 29 01:38:43 erepnk10 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
May 29 01:38:43 erepnk10 kernel: [23316]     0 23316     1048       76       8       3        0           993 tini
May 29 01:38:43 erepnk10 kernel: [23362]     0 23362    90649    19727     179       3        0           993 celery
May 29 01:38:43 erepnk10 kernel: [26295]     0 26295   135372    62602     267       3        0           993 celery
May 29 01:38:43 erepnk10 kernel: [27714]     0 27714   135002    61169     262       3        0           993 celery
May 29 01:38:43 erepnk10 kernel: Memory cgroup out of memory: Kill process 26295 (celery) score 1472 or sacrifice child
May 29 01:38:43 erepnk10 kernel: Killed process 26295 (celery) total-vm:541488kB, anon-rss:244112kB, file-rss:6296kB, shmem-rss:0kB
May 29 01:38:43 erepnk10 kernel: oom_reaper: reaped process 26295 (celery), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
May 29 01:38:43 erepnk10 kernel: celery invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null),  order=0, oom_score_adj=993
May 29 01:38:43 erepnk10 kernel: celery cpuset=613687bcb8505dadcc2dabed12bdd4fe5d1ffc4da1804833d94ea43f6ac1d91c mems_allowed=0-1
May 29 01:38:43 erepnk10 kernel: CPU: 6 PID: 22285 Comm: celery Not tainted 4.14.84-coreos #1
May 29 01:38:43 erepnk10 kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
May 29 01:38:43 erepnk10 kernel: Call Trace:
May 29 01:38:43 erepnk10 kernel:  dump_stack+0x5c/0x85
May 29 01:38:43 erepnk10 kernel:  dump_header+0x94/0x229
May 29 01:38:43 erepnk10 kernel:  oom_kill_process+0x213/0x410
May 29 01:38:43 erepnk10 kernel:  out_of_memory+0x2ab/0x4c0
May 29 01:38:43 erepnk10 kernel:  mem_cgroup_out_of_memory+0x49/0x80
May 29 01:38:43 erepnk10 kernel:  mem_cgroup_oom_synchronize+0x2ed/0x330
May 29 01:38:43 erepnk10 kernel:  ? mem_cgroup_css_online+0x30/0x30
May 29 01:38:43 erepnk10 kernel:  pagefault_out_of_memory+0x32/0x77
May 29 01:38:43 erepnk10 kernel:  __do_page_fault+0x4b3/0x4c0
May 29 01:38:43 erepnk10 kernel:  ? page_fault+0x2f/0x50
May 29 01:38:43 erepnk10 kernel:  page_fault+0x45/0x50
May 29 01:38:43 erepnk10 kernel: RIP: 714e9430:0x4282
May 29 01:38:43 erepnk10 kernel: RSP: 0004:0000000000000008 EFLAGS: 7ff32044dfd0
May 29 01:38:43 erepnk10 kernel: Task in /kubepods/burstable/podc2c885b7-815a-11e9-bacb-005056bced2e/613687bcb8505dadcc2dabed12bdd4fe5d1ffc4da1804833d94ea43f6ac1d91c killed as a result of limit of /kubepods/burstable/podc2c885b7-815a-11e9-bacb-005056bced2e/613687bcb8505dadcc2dabed12bdd4fe5d1ffc4da1804833d94ea43f6ac1d91c
May 29 01:38:43 erepnk10 kernel: memory: usage 524288kB, limit 524288kB, failcnt 0
May 29 01:38:43 erepnk10 kernel: memory+swap: usage 524288kB, limit 524288kB, failcnt 273334
May 29 01:38:43 erepnk10 kernel: kmem: usage 15152kB, limit 9007199254740988kB, failcnt 0
May 29 01:38:43 erepnk10 kernel: Memory cgroup stats for /kubepods/burstable/podc2c885b7-815a-11e9-bacb-005056bced2e/613687bcb8505dadcc2dabed12bdd4fe5d1ffc4da1804833d94ea43f6ac1d91c: cache:100KB rss:509036KB rss_huge:24576KB shmem:0KB mapped_file:24KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:509036KB inactive_file:88KB active_file:12KB unevictable:0KB
May 29 01:38:43 erepnk10 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
May 29 01:38:43 erepnk10 kernel: [26113]     0 26113     1048      113       8       3        0           993 tini
May 29 01:38:43 erepnk10 kernel: [26166]     0 26166    90652    19837     182       3        0           993 celery
May 29 01:38:43 erepnk10 kernel: [23204]     0 23204   129831    57008     263       3        0           993 celery
May 29 01:38:43 erepnk10 kernel: [11216]     0 11216   129929    57230     258       3        0           993 celery
May 29 01:38:43 erepnk10 kernel: [22285]     0 22285    35551     9916      74       3        0           993 celery
May 29 01:38:43 erepnk10 kernel: Memory cgroup out of memory: Kill process 11216 (celery) score 1431 or sacrifice child
May 29 01:38:43 erepnk10 kernel: Killed process 11216 (celery) total-vm:519716kB, anon-rss:223004kB, file-rss:5916kB, shmem-rss:0kB
May 29 01:38:43 erepnk10 kernel: oom_reaper: reaped process 11216 (celery), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
May 29 01:38:44 erepnk10 kernel: celery invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null),  order=0, oom_score_adj=993
May 29 01:38:44 erepnk10 kernel: celery cpuset=8ce74fb4ecb3c9c8f632887947b677bbffe39eea443c7409e3d3254b402dddf1 mems_allowed=0-1
May 29 01:38:44 erepnk10 kernel: CPU: 1 PID: 22307 Comm: celery Not tainted 4.14.84-coreos #1
May 29 01:38:44 erepnk10 kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
May 29 01:38:44 erepnk10 kernel: Call Trace:
May 29 01:38:44 erepnk10 kernel:  dump_stack+0x5c/0x85
May 29 01:38:44 erepnk10 kernel:  dump_header+0x94/0x229
May 29 01:38:44 erepnk10 kernel:  oom_kill_process+0x213/0x410
May 29 01:38:44 erepnk10 kernel:  out_of_memory+0x2ab/0x4c0
May 29 01:38:44 erepnk10 kernel:  mem_cgroup_out_of_memory+0x49/0x80
May 29 01:38:44 erepnk10 kernel:  mem_cgroup_oom_synchronize+0x2ed/0x330
May 29 01:38:44 erepnk10 kernel:  ? mem_cgroup_css_online+0x30/0x30
May 29 01:38:44 erepnk10 kernel:  pagefault_out_of_memory+0x32/0x77
May 29 01:38:44 erepnk10 kernel:  __do_page_fault+0x4b3/0x4c0
May 29 01:38:44 erepnk10 kernel:  ? page_fault+0x2f/0x50
May 29 01:38:44 erepnk10 kernel:  page_fault+0x45/0x50
May 29 01:38:44 erepnk10 kernel: RIP: 20244e0:0x55bfec80b1b0
May 29 01:38:44 erepnk10 kernel: RSP: 0163:000000000000002c EFLAGS: 00000058
May 29 01:38:44 erepnk10 kernel: Task in /kubepods/burstable/pod0613c655-7ca2-11e9-bacb-005056bced2e/8ce74fb4ecb3c9c8f632887947b677bbffe39eea443c7409e3d3254b402dddf1 killed as a result of limit of /kubepods/burstable/pod0613c655-7ca2-11e9-bacb-005056bced2e/8ce74fb4ecb3c9c8f632887947b677bbffe39eea443c7409e3d3254b402dddf1
May 29 01:38:44 erepnk10 kernel: memory: usage 524288kB, limit 524288kB, failcnt 0
May 29 01:38:44 erepnk10 kernel: memory+swap: usage 524288kB, limit 524288kB, failcnt 158890
May 29 01:38:44 erepnk10 kernel: kmem: usage 14456kB, limit 9007199254740988kB, failcnt 0
May 29 01:38:44 erepnk10 kernel: Memory cgroup stats for /kubepods/burstable/pod0613c655-7ca2-11e9-bacb-005056bced2e/8ce74fb4ecb3c9c8f632887947b677bbffe39eea443c7409e3d3254b402dddf1: cache:100KB rss:509732KB rss_huge:47104KB shmem:0KB mapped_file:40KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:509732KB inactive_file:56KB active_file:44KB unevictable:0KB
May 29 01:38:44 erepnk10 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
May 29 01:38:44 erepnk10 kernel: [27893]     0 27893     1048       21       8       3        0           993 tini
May 29 01:38:44 erepnk10 kernel: [27980]     0 27980    90654    19635     179       3        0           993 celery
May 29 01:38:44 erepnk10 kernel: [ 8083]     0  8083   112364    39107     213       3        0           993 celery
May 29 01:38:44 erepnk10 kernel: [21783]     0 21783   151728    78528     300       4        0           993 celery
May 29 01:38:44 erepnk10 kernel: [22307]     0 22307    34014     8485      70       4        0           993 celery
May 29 01:38:44 erepnk10 kernel: Memory cgroup out of memory: Kill process 21783 (celery) score 1593 or sacrifice child
May 29 01:38:44 erepnk10 kernel: Killed process 21783 (celery) total-vm:606912kB, anon-rss:309588kB, file-rss:4524kB, shmem-rss:0kB
May 29 01:38:45 erepnk10 kernel: oom_reaper: reaped process 21783 (celery), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
May 29 01:39:51 erepnk10 kernel: EXT4-fs error (device sda9): ext4_lookup:1587: inode #11496690: comm du: deleted inode referenced: 11470787
May 29 01:39:51 erepnk10 kernel: Aborting journal on device sda9-8.
May 29 01:39:51 erepnk10 kernel: EXT4-fs error (device sda9) in ext4_setattr:5498: Readonly filesystem
May 29 01:39:51 erepnk10 kernel: EXT4-fs (sda9): Remounting filesystem read-only

Container Linux Version

$ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1911.5.0
VERSION_ID=1911.5.0
BUILD_ID=2018-12-15-2317
PRETTY_NAME="Container Linux by CoreOS 1911.5.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Environment

platform: VMware kernel: 4.14.84-coreos

bgilbert commented 5 years ago

1911.5.0 was old even when this bug was filed. Are you still seeing it on a current release of Container Linux?

redbaron commented 5 years ago

We fixed application and it stopped OOMKiller from ruining our nodes :) that happened before coreos upgrade, so I can't tell :(

bgilbert commented 5 years ago

Okay. 4.14 had known issues along these lines. We're shipping the 4.19 kernel now, so I'll close. Thanks for reporting.