cockpit-project / cockpit

Cockpit is a web-based graphical interface for servers.
http://www.cockpit-project.org/
GNU Lesser General Public License v2.1
11.25k stars 1.11k forks source link

remotectl OOM out of memory errors #12949

Closed SomePersonSomeWhereInTheWorld closed 3 years ago

SomePersonSomeWhereInTheWorld commented 5 years ago

I created a bug upstream for Fedora 30. I had to uninstall all the Cockpit services. Even after I ran systemctl disable cockpiit, it would restart, or try to restart after a reboot. A couple times an hour the emotectlservice would get killed via the out-of-memory killer process. I ran the Dell hardware diagnostics a couple of times, i.e., in the iDRAC, and no errors.

Oct  9 10:37:10 ourserver kernel: (emotectl) invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Oct  9 10:37:11 ourserver kernel: CPU: 2 PID: 27603 Comm: (emotectl) Not tainted 5.2.17-200.fc30.x86_64 #1
Oct  9 10:37:11 ourserver kernel: Hardware name: Dell Inc. PowerEdge R740/00WGD1, BIOS 2.3.10 08/15/2019
Oct  9 10:37:11 ourserver kernel: Call Trace:
Oct  9 10:37:11 ourserver kernel: dump_stack+0x5c/0x80
Oct  9 10:37:11 ourserver kernel: dump_header+0x51/0x2ff
Oct  9 10:37:11 ourserver kernel: ? sched_clock+0x5/0x10
Oct  9 10:37:11 ourserver kernel: ? oom_unkillable_task+0x9a/0xd0
Oct  9 10:37:11 ourserver kernel: oom_kill_process.cold+0xb/0x10
Oct  9 10:37:11 ourserver kernel: out_of_memory+0x1d4/0x480
Oct  9 10:37:11 ourserver kernel: __alloc_pages_slowpath+0xcc4/0xde0
Oct  9 10:37:11 ourserver kernel: __alloc_pages_nodemask+0x2ee/0x340
Oct  9 10:37:11 ourserver kernel: pagecache_get_page+0xb4/0x370
Oct  9 10:37:11 ourserver kernel: filemap_fault+0x683/0x9f0
Oct  9 10:37:11 ourserver kernel: ? __mod_lruvec_state+0x3f/0xe0
Oct  9 10:37:11 ourserver kernel: ? page_add_file_rmap+0xf4/0x230
Oct  9 10:37:11 ourserver kernel: ? alloc_set_pte+0x498/0x6b0
Oct  9 10:37:11 ourserver kernel: ? file_fdatawait_range+0x20/0x20
Oct  9 10:37:11 ourserver kernel: ? _cond_resched+0x15/0x30
Oct  9 10:37:11 ourserver kernel: __xfs_filemap_fault+0x6d/0x200 [xfs]
Oct  9 10:37:11 ourserver kernel: __do_fault+0x36/0x180
Oct  9 10:37:11 ourserver kernel: __handle_mm_fault+0x1091/0x1ac0
Oct  9 10:37:11 ourserver kernel: handle_mm_fault+0xc4/0x1e0
Oct  9 10:37:11 ourserver kernel: do_user_addr_fault+0x1f6/0x450
Oct  9 10:37:11 ourserver kernel: do_page_fault+0x33/0x120
Oct  9 10:37:11 ourserver kernel: ? page_fault+0x8/0x30
Oct  9 10:37:11 ourserver kernel: page_fault+0x1e/0x30
Oct  9 10:37:11 ourserver kernel: RIP: 0033:0x7f995c93ad50
Oct  9 10:37:11 ourserver kernel: Code: Bad RIP value.
Oct  9 10:37:11 ourserver kernel: RSP: 002b:00007ffc33ddc5f8 EFLAGS: 00010202
Oct  9 10:37:11 ourserver kernel: RAX: 0000000000000000 RBX: 00007f88076fd040 RCX: 000000003ffffff8
Oct  9 10:37:11 ourserver kernel: RDX: 00007f7f50000010 RSI: 0000000000000000 RDI: 00007f88076fd010
Oct  9 10:37:11 ourserver kernel: RBP: 000055d4b1f9e8c0 R08: 00007f7f50000010 R09: 0000000000000000
Oct  9 10:37:11 ourserver kernel: R10: 0000000000000022 R11: 0000000000000246 R12: 000055d4b1ef2860
Oct  9 10:37:11 ourserver kernel: R13: 0000000000000004 R14: 00007ffc33ddc6b0 R15: 00007f8b4ffffe90
Oct  9 10:37:11 ourserver kernel: Mem-Info:
Oct  9 10:37:11 ourserver kernel: active_anon:15200764 inactive_anon:895460 isolated_anon:0#012 active_file:0 inactive_file:20 isolated_file:0#012 unevictable:0 dirty:0 writeback:57 unstable:0#012 slab_reclaimable:23356 slab_unreclaimable:50962#012 mapped:0 shmem:2 pagetables:44484 bounce:0#012 free:56603 free_pcp:0 free_cma:0
Oct  9 10:37:11 ourserver kernel: Node 0 active_anon:30068504kB inactive_anon:1771616kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Oct  9 10:37:11 ourserver kernel: Node 1 active_anon:30734552kB inactive_anon:1810224kB active_file:92kB inactive_file:128kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:228kB shmem:8kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
Oct  9 10:37:11 ourserver kernel: Node 0 DMA free:15880kB min:20kB low:32kB high:44kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15980kB managed:15896kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Oct  9 10:37:11 ourserver kernel: lowmem_reserve[]: 0 1396 31576 31576 31576
Oct  9 10:37:11 ourserver kernel: Node 0 DMA32 free:122688kB min:1968kB low:3396kB high:4824kB active_anon:1322012kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1521664kB managed:1456124kB mlocked:0kB kernel_stack:16kB pagetables:2584kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Oct  9 10:37:11 ourserver kernel: lowmem_reserve[]: 0 0 30180 30180 30180
Oct  9 10:37:11 ourserver kernel: Node 0 Normal free:42556kB min:42596kB low:73500kB high:104404kB active_anon:28746492kB inactive_anon:1771616kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:31457280kB managed:30912644kB mlocked:0kB kernel_stack:3096kB pagetables:86508kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Oct  9 10:37:11 ourserver kernel: lowmem_reserve[]: 0 0 0 0 0
Oct  9 10:37:11 ourserver kernel: Node 1 Normal free:45288kB min:45520kB low:78548kB high:111576kB active_anon:30734552kB inactive_anon:1810224kB active_file:0kB inactive_file:520kB unevictable:0kB writepending:228kB present:33554432kB managed:33028052kB mlocked:0kB kernel_stack:2648kB pagetables:88844kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Oct  9 10:37:11 ourserver kernel: lowmem_reserve[]: 0 0 0 0 0
Oct  9 10:37:11 ourserver kernel: Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15880kB
Oct  9 10:37:11 ourserver kernel: Node 0 DMA32: 102*4kB (UE) 87*8kB (UE) 79*16kB (UME) 46*32kB (UME) 37*64kB (UME) 36*128kB (UE) 31*256kB (UE) 27*512kB (UME) 22*1024kB (UME) 7*2048kB (UME) 13*4096kB (UE) = 122688kB
Oct  9 10:37:11 ourserver kernel: Node 0 Normal: 198*4kB (UME) 70*8kB (UME) 32*16kB (UME) 81*32kB (UME) 77*64kB (UME) 55*128kB (UME) 17*256kB (UM) 0*512kB 20*1024kB (M) 1*2048kB (M) 0*4096kB = 43304kB
Oct  9 10:37:11 ourserver kernel: Node 1 Normal: 229*4kB (UME) 212*8kB (UME) 56*16kB (UME) 38*32kB (UME) 143*64kB (UME) 75*128kB (UME) 1*256kB (E) 0*512kB 18*1024kB (M) 2*2048kB (M) 0*4096kB = 46260kB
Oct  9 10:37:11 ourserver kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Oct  9 10:37:11 ourserver kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Oct  9 10:37:11 ourserver kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Oct  9 10:37:11 ourserver kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Oct  9 10:37:11 ourserver kernel: 4872 total pagecache pages
Oct  9 10:37:11 ourserver kernel: 4792 pages in swap cache
Oct  9 10:37:11 ourserver kernel: Swap cache stats: add 4553406660, delete 4552431434, find 4529567/9839697
Oct  9 10:37:11 ourserver kernel: Free swap  = 0kB
Oct  9 10:37:11 ourserver kernel: Total swap = 23441404kB
Oct  9 10:37:11 ourserver kernel: 16637339 pages RAM
Oct  9 10:37:11 ourserver kernel: 0 pages HighMem/MovableOnly
Oct  9 10:37:11 ourserver kernel: 284160 pages reserved
Oct  9 10:37:11 ourserver kernel: 0 pages cma reserved
Oct  9 10:37:11 ourserver kernel: 0 pages hwpoisoned
Oct  9 10:37:11 ourserver kernel: Tasks state (memory values in pages):
Oct  9 10:37:11 ourserver kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Oct  9 10:37:11 ourserver kernel: [    852]     0   852    29694       16   200704      425             0 systemd-journal
Oct  9 10:37:11 ourserver kernel: [    884]     0   884     7957        0    81920      645         -1000 systemd-udevd
Oct  9 10:37:11 ourserver kernel: [   1280]    32  1280     3078        0    61440      585             0 rpcbind
Oct  9 10:37:11 ourserver kernel: [   1289]     0  1289     1560        0    49152      131             0 rpc.idmapd
Oct  9 10:37:11 ourserver kernel: [   1294]     0  1294    19842       33    53248       53             0 irqbalance
Oct  9 10:37:11 ourserver kernel: [   1296]     0  1296    78035        0   106496      264             0 rngd
Oct  9 10:37:11 ourserver kernel: [   1297]     0  1297    78329        0   106496      428             0 ModemManager
Oct  9 10:37:11 ourserver kernel: [   1299]    70  1299     2989        2    57344      208             0 avahi-daemon
Oct  9 10:37:11 ourserver kernel: [   1302]     0  1302      624        0    45056       30             0 mcelog
Oct  9 10:37:11 ourserver kernel: [   1306]     0  1306     2613        0    57344      328             0 smartd
Oct  9 10:37:11 ourserver kernel: [   1308]     0  1308   134606        0   147456      772             0 NetworkManager
Oct  9 10:37:11 ourserver kernel: [   1309]     0  1309   186043        0  1073152      450             0 rsyslogd
Oct  9 10:37:11 ourserver kernel: [   1312]   994  1312    21125        0    65536      228             0 chronyd
Oct  9 10:37:11 ourserver kernel: [   1314]     0  1314   114629        0   122880      486             0 abrtd
Oct  9 10:37:11 ourserver kernel: [   1321]    70  1321     2956        0    53248      172             0 avahi-daemon
Oct  9 10:37:11 ourserver kernel: [   1322]    81  1322    67186        0    61440      201          -900 dbus-broker-lau
Oct  9 10:37:11 ourserver kernel: [   1323]     0  1323    65916        0    90112      181             0 gssproxy
Oct  9 10:37:11 ourserver kernel: [   1330]  1001  1330     1440       10    53248      147             0 lmgrd
Oct  9 10:37:11 ourserver kernel: [   1331]  1001  1331    39782       22    69632      258             0 MLM
Oct  9 10:37:11 ourserver kernel: [   1392]    81  1392     1495        8    49152      285          -900 dbus-broker
Oct  9 10:37:11 ourserver kernel: [   1397]     0  1397   204763       11   774144      469             0 abrt-dump-journ
Oct  9 10:37:11 ourserver kernel: [   1398]     0  1398   202700        3   684032      452             0 abrt-dump-journ
Oct  9 10:37:11 ourserver kernel: [   1399]     0  1399   206779       15   733184      450             0 abrt-dump-journ
Oct  9 10:37:11 ourserver kernel: [   1410]     0  1410    58909      658    90112     3155             0 x2gocleansessio
Oct  9 10:37:11 ourserver kernel: [   1432]   997  1432   482919        0   225280     1302             0 polkitd
Oct  9 10:37:11 ourserver kernel: [   1724]     0  1724     9850        1   118784     6346             0 rpc.ypxfrd
Oct  9 10:37:11 ourserver kernel: [   1726]     0  1726    10851        0   126976     6809             0 ypserv
Oct  9 10:37:11 ourserver kernel: [   1728]     0  1728     2993        0    61440      238         -1000 sshd
Oct  9 10:37:11 ourserver kernel: [   1729]    29  1729    15483        0   159744    12482             0 rpc.statd
Oct  9 10:37:11 ourserver kernel: [   1731]     0  1731     9896        1   118784     6348             0 rpc.yppasswdd
Oct  9 10:37:11 ourserver kernel: [   1742]     0  1742    16616        0   167936    12785             0 rpc.mountd
Oct  9 10:37:11 ourserver kernel: [   1822]     0  1822    89938        0   192512    12732             0 ypbind
Oct  9 10:37:11 ourserver kernel: [   1885]     0  1885     3471       15    65536      513             0 sendmail
Oct  9 10:37:11 ourserver kernel: [   1892]    51  1892     3327        0    69632      407             0 sendmail
Oct  9 10:37:11 ourserver kernel: [   2121]     0  2121    11714        1   135168     6451             0 systemd-logind
Oct  9 10:37:11 ourserver kernel: [   2123]     0  2123   297864        0   303104    13000             0 automount
Oct  9 10:37:11 ourserver kernel: [   2124]     0  2124    59631        0    98304      340             0 cupsd
Oct  9 10:37:11 ourserver kernel: [   2127]     0  2127      910        0    49152       54             0 atd
Oct  9 10:37:11 ourserver kernel: [   2151]     0  2151    54298        8    61440      205             0 crond
Oct  9 10:37:11 ourserver kernel: [   2152]     0  2152     5014        0    81920      444             0 systemd
Oct  9 10:37:11 ourserver kernel: [   2155]     0  2155     6141        0    81920      842             0 (sd-pam)
Oct  9 10:37:11 ourserver kernel: [   2423]     0  2423    53864        0    45056       34             0 agetty
Oct  9 10:37:11 ourserver kernel: [  21827]     0 21827    68897        0   167936    12630             0 crond
Oct  9 10:37:11 ourserver kernel: [  21831]     0 21831    54095        1    57344       59             0 freshclam-sleep
Oct  9 10:37:11 ourserver kernel: [  21833]     0 21833    53742        0    57344       22             0 sleep
Oct  9 10:37:11 ourserver kernel: [  25206]     0 25206    69185        0   172032    12685             0 crond
Oct  9 10:37:11 ourserver kernel: [  25229]     0 25229    54095        0    57344       56             0 sh
Oct  9 10:37:11 ourserver kernel: [  25230]     0 25230    54095        1    61440       58             0 cronic
Oct  9 10:37:11 ourserver kernel: [  25232]     0 25232    56910        0    73728     1976             0 rsnapshot
Oct  9 10:37:11 ourserver kernel: [  25413]     0 25413    54095        1    61440       57             0 rsync-no-vanish
Oct  9 10:37:11 ourserver kernel: [  25414]     0 25414   108933      146   495616    40680             0 rsync
Oct  9 10:37:11 ourserver kernel: [  25415]     0 25415    54095        0    57344       59             0 rsync-no-vanish
Oct  9 10:37:11 ourserver kernel: [  25416]     0 25416    53935        0    53248       41             0 grep
Oct  9 10:37:11 ourserver kernel: [  25417]     0 25417    56089       31    73728      812             0 ssh
Oct  9 10:37:11 ourserver kernel: [  25418]     0 25418   108725      824   491520    32552             0 rsync
Oct  9 10:37:11 ourserver kernel: [  26670]     0 26670    16711        0   172032    12615             0 sshd
Oct  9 10:37:11 ourserver kernel: [  26724]  6105 26724    16711       21   172032    12590             0 sshd
Oct  9 10:37:11 ourserver kernel: [  26731]  6105 26731    62595        0   118784     6649             0 bash
Oct  9 10:37:11 ourserver kernel: [  26842]  6105 26842    69011        0   172032    12524             0 su
Oct  9 10:37:11 ourserver kernel: [  26862]     0 26862    54439       13    65536      410             0 bash
Oct  9 10:37:11 ourserver kernel: [  27603]     0 27603 27306114 16088978 174247936  5635073             0 (emotectl)
Oct  9 10:37:11 ourserver kernel: [  27754]     0 27754    53831        9    53248       59             0 less
Oct  9 10:37:11 ourserver kernel: [  27774]     0 27774      110        1    32768        0             0 x2golistsession
Oct  9 10:37:11 ourserver kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/init.scope,task=(emotectl),pid=27603,uid=0
Oct  9 10:37:11 ourserver kernel: Out of memory: Killed process 27603 ((emotectl)) total-vm:109224456kB, anon-rss:64355912kB, file-rss:0kB, shmem-rss:0kB
Oct  9 10:37:13 ourserver kernel: oom_reaper: reaped process 27603 ((emotectl)), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Oct  9 10:37:14 ourserver systemd[1]: cockpit.service: Control process exited, code=killed, status=9/KILL
Oct  9 10:37:14 ourserver systemd[1]: cockpit.service: Failed with result 'timeout'.
Oct  9 10:37:14 ourserver systemd[1]: Failed to start Cockpit Web Service.
Oct  9 10:37:14 ourserver systemd[1]: Starting Cockpit Web Service..

If it helps to know we still use NIS and this is a backup server so rsnapshot/rsync runs throughout the day.

And at the end of journalctl -u cockpit are just these entries:

Oct 09 10:34:17  systemd[1]: Starting Cockpit Web Service...
Oct 09 10:35:48  systemd[1]: cockpit.service: Start-pre operation timed out. Terminating.
Oct 09 10:37:14  systemd[1]: cockpit.service: Control process exited, code=killed, status=9/KILL
Oct 09 10:37:14  systemd[1]: cockpit.service: Failed with result 'timeout'.
Oct 09 10:37:14  systemd[1]: Failed to start Cockpit Web Service.

After uninstalling, it's been 5 hours without an OOM, when there were 2-3 an hour.

martinpitt commented 5 years ago

Thanks for your report! So remotectl eats away all your memory, but the log doesn't say why. As this is specific to your machine and not reproducible on others (at least not without further information), could you at least temporarily install cockpit-ws again? If you don't enable cockpit.socket, it will not actually do anything. Merely installing the package won't enable it, but just in case, please make sure that systemctl is-enabled cockpit.socket says "disabled".

After that, let's first try to run this as normal user in a temporary directory, which is safer:

mkdir -p /tmp/x/cockpit/ws-certs.d
G_MESSAGES_DEBUG=all XDG_CONFIG_DIRS=/tmp/x remotectl certificate --ensure

What's the output of this? It should normally take only a few seconds. It fails because of a permission error at the end, but that's fine. If it goes haywire, please Control-C it.

The expected output is something like

** INFO: 12:59:09.778: Generating temporary certificate using: sscg --quiet --lifetime 3650 --key-strength 2048 --cert-key-file /tmp/x/cockpit/ws-certs.d/0-self-signed.cert --cert-file /tmp/x/cockpit/ws-certs.d/0-self-signed.cert --ca-file /tmp/x/cockpit/ws-certs.d/0-self-signed-ca.pem --hostname donald --organization 607e9444bd2e4594ab570d4df4bd766a --subject-alt-name localhost --subject-alt-name IP:127.0.0.1/255.255.255.255
(remotectl:22394): GLib-DEBUG: 12:59:09.779: posix_spawn avoided (fd close requested) 
(remotectl:22394): GLib-GIO-DEBUG: 12:59:11.095: _g_io_module_get_default: Found default implementation gnutls (GTlsBackendGnutls) for ?gio-tls-backend?
** (remotectl:22394): DEBUG: 12:59:11.095: loaded 1 certificates from /tmp/x/cockpit/ws-certs.d/0-self-signed.cert
remotectl: couldn't set certificate ownership: /tmp/x/cockpit/ws-certs.d/0-self-signed.cert: Operation not permitted

Afterwards there should be a 0-self-signed-ca.pem and 0-self-signed.cert in ls -l /tmp/x/cockpit/. Can you please copy&paste the output of that as well?

Thanks!

SomePersonSomeWhereInTheWorld commented 5 years ago

So remotectl eats away all your memory

The logs refer to it as emotectl, is that intentional?

systemctl is-enabled cockpit.socket says "disabled".

systemctl status cockpit.socket
● cockpit.socket - Cockpit Web Service Socket
   Loaded: loaded (/usr/lib/systemd/system/cockpit.socket; disabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:cockpit-ws(8)
   Listen: [::]:9090 (Stream)

Oct 04 13:28:28 systemd[1]: Starting Cockpit Web Service Socket.
Oct 04 13:28:28 systemd[1]: Listening on Cockpit Web Service Socket.
Oct 09 12:27:46 systemd[1]: Stopping Cockpit Web Service Socket.
Oct 09 12:27:46 systemd[1]: cockpit.socket: Succeeded.
Oct 09 12:27:46 systemd[1]: Closed Cockpit Web Service Socket.

G_MESSAGES_DEBUG=all XDG_CONFIG_DIRS=/tmp/x remotectl certificate --ensure

$ G_MESSAGES_DEBUG=all XDG_CONFIG_DIRS=/tmp/x remotectl certificate --ensure
** INFO: 09:40:42.871: Generating temporary certificate using: sscg --quiet --lifetime 3650 --key-strength 2048 --cert-key-file /tmp/x/cockpit/ws-certs.d/0-self-signed.cert --cert-file /tmp/x/cockpit/ws-certs.d/0-self-signed.cert --ca-file /tmp/x/cockpit/ws-certs.d/0-self-signed-ca.pem --hostname ourdomain.edu --organization 07f397410f92404a9ae1e45b67b62b5f --subject-alt-name localhost --subject-alt-name IP:127.0.0.1/255.255.255.255
(remotectl:10491): GLib-DEBUG: 09:40:42.872: posix_spawn avoided (fd close requested) 
(remotectl:10491): GLib-GIO-DEBUG: 09:40:43.544: _g_io_module_get_default: Found default implementation gnutls (GTlsBackendGnutls) for ?gio-tls-backend?
** (remotectl:10491): DEBUG: 09:40:43.544: loaded 1 certificates from /tmp/x/cockpit/ws-certs.d/0-self-signed.cert
remotectl: couldn't set certificate ownership: /tmp/x/cockpit/ws-certs.d/0-self-signed.cert: Operation not permitted
ls -l /tmp/x/cockpit/
total 0
drwxrwxr-x 2 localguy localguy 80 Oct 10 09:40 ws-certs.d
ls -l /tmp/x/cockpit/ws-certs.d/0-self-signed*
-rw-r--r-- 1 localguy localguy 2199 Oct 10 09:40 /tmp/x/cockpit/ws-certs.d/0-self-signed-ca.pem
-rw------- 1 localguy localguy 3436 Oct 10 09:40 /tmp/x/cockpit/ws-certs.d/0-self-signed.cert
martinpitt commented 5 years ago

The logs refer to it as emotectl, is that intentional?

Not really by design, but it's a quirk how Linux works. It puts parentheses around died processes in /proc/pid/comm, and since it can't change the string size, the first and last character gets chopped off.

So it seems that remotectl invocation as user went fine? Next step, can you please check if you have a system-wide cockpit certificate? You should have if you ever actually used cockpit. I. e sudo ls -l /etc /cockpit/ws-certs.d/. There should be the same two files you previously had in /tmp/x/.

After you captured this, please run

sudo G_MESSAGES_DEBUG=all remotectl certificate --ensure --user=root --group=cockpit-ws --selinux-type=etc_t

copy&paste the output, and observe how long this takes. With existing certs, it should be instantanous, and say "loaded 1 certificates". Otherwise it should generate new ones like in the experiment above.

SomePersonSomeWhereInTheWorld commented 5 years ago

sudo ls -l /etc /cockpit/ws-certs.d/

-rw-r--r--. 1 root root       2216 Oct  1  2018 0-self-signed-ca.pem
-rw-r-----. 1 root cockpit-ws 3452 Oct  1  2018 0-self-signed.cert

sudo G_MESSAGES_DEBUG=all remotectl certificate --ensure --user=root --group=cockpit-ws --selinux-type=etc_t

observe how long this takes

Instantly:

sudo G_MESSAGES_DEBUG=all remotectl certificate --ensure --user=root --group=cockpit-ws --selinux-type=etc_t
(remotectl:14592): GLib-GIO-DEBUG: 10:11:54.844: _g_io_module_get_default: Found default implementation gnutls (GTlsBackendGnutls) for ?gio-tls-backend?
** (remotectl:14592): DEBUG: 10:11:54.844: loaded 1 certificates from /etc/cockpit/ws-certs.d/0-self-signed.cert
(remotectl:14592): GLib-DEBUG: 10:11:54.845: posix_spawn avoided (fd close requested) 
martinpitt commented 5 years ago

Hmm.. So do you still get the rogue process and OOM if you do sudo systemctl start cockpit? That should also return instantly, if it hangs, then please check top/ps whether remotectl is acting up. If it does, sudo systemctl status cockpit would be useful.

SomePersonSomeWhereInTheWorld commented 5 years ago

So do you still get the rogue process and OOM if you do sudo systemctl start cockpit? That should also return instantly

It hangs. And yes another OOM.

sudo systemctl start cockpit
Job for cockpit.service failed because a timeout was exceeded.
See "systemctl status cockpit.service" and "journalctl -xe" for details.

 systemctl status cockpit.service
● cockpit.service - Cockpit Web Service
   Loaded: loaded (/usr/lib/systemd/system/cockpit.service; static; vendor preset: disabled)
   Active: failed (Result: timeout) since Thu 2019-10-10 10:30:03 EDT; 18s ago
     Docs: man:cockpit-ws(8)
  Process: 16908 ExecStartPre=/usr/sbin/remotectl certificate --ensure --user=root --group=cockpit-ws --selinux-type=etc_t (code=killed, signal=KILL)

Oct 10 10:27:18 systemd[1]: Starting Cockpit Web Service...
Oct 10 10:28:48 systemd[1]: cockpit.service: Start-pre operation timed out. Terminating.
Oct 10 10:30:03 systemd[1]: cockpit.service: Control process exited, code=killed, status=9/KILL
Oct 10 10:30:03 systemd[1]: cockpit.service: Failed with result 'timeout'.
Oct 10 10:30:03 systemd[1]: Failed to start Cockpit Web Service.

Perhaps you can tell my how to force a coredump and run gdb backtrace on it?

SomePersonSomeWhereInTheWorld commented 5 years ago

I had to disable Cockpit again. I was able to capture a strace of the process if that helps:

strace: Process 18368 attached
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(736), sin_addr=inet_addr("0.0.0.0")}, [128->16]) = 0
getsockopt(5, SOL_SOCKET, SO_TYPE, [2], [4]) = 0
getpid()                                = 18368
setsockopt(5, SOL_IP, IP_RECVERR, [1], 4) = 0
ioctl(5, FIONBIO, [1])                  = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
sendto(5, "]\241\325\265\0\0\0\0\0\0\0\2\0\1\206\240\0\0\0\4\0\0\0\3\0\0\0\0\0\0\0\0"..., 92, 0, {sa_family=AF_INET, sin_port=htons(111), sin_addr=inet_addr("150.108.64.52")}, 16) = 92
poll([{fd=5, events=POLLIN}], 1, 15000) = 1 ([{fd=5, revents=POLLIN}])
recvfrom(5, "]\241\325\265\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\023150."..., 8800, 0, NULL, NULL) = 48
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
close(5)                                = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP) = 5
getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, [128->16]) = 0
getsockopt(5, SOL_SOCKET, SO_TYPE, [2], [4]) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, [128->16]) = 0
getpid()                                = 18368
bind(5, {sa_family=AF_INET, sin_port=htons(736), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(736), sin_addr=inet_addr("0.0.0.0")}, [128->16]) = 0
getsockopt(5, SOL_SOCKET, SO_TYPE, [2], [4]) = 0
getpid()                                = 18368
setsockopt(5, SOL_IP, IP_RECVERR, [1], 4) = 0
ioctl(5, FIONBIO, [1])                  = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
close(4)                                = 0
close(3)                                = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
close(5)                                = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
openat(AT_FDCWD, "/etc/netconfig", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=767, ...}) = 0
read(3, "#\n# The network configuration fi"..., 4096) = 767
openat(AT_FDCWD, "/etc/services", O_RDONLY|O_CLOEXEC) = 4
lseek(4, 0, SEEK_CUR)                   = 0
fstat(4, {st_mode=S_IFREG|0644, st_size=692323, ...}) = 0
read(4, "# /etc/services:\n# $Id: services"..., 4096) = 4096
lseek(4, 0, SEEK_CUR)                   = 4096
read(4, "deWeb HTTP\nhttp            80/ud"..., 4096) = 4096
close(4)                                = 0
stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=101, ...}) = 0
openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_CLOEXEC) = 4
lseek(4, 0, SEEK_CUR)                   = 0
fstat(4, {st_mode=S_IFREG|0644, st_size=353, ...}) = 0
read(4, "127.0.0.1   localhost localhost."..., 4096) = 353
lseek(4, 0, SEEK_CUR)                   = 353
read(4, "", 4096)                       = 0
close(4)                                = 0
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 4
getsockname(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, [128->16]) = 0
getsockopt(4, SOL_SOCKET, SO_TYPE, [1], [4]) = 0
getsockname(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, [128->16]) = 0
getpid()                                = 18368
bind(4, {sa_family=AF_INET, sin_port=htons(736), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
mmap(NULL, 4294967296, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f25d9e46000
mmap(NULL, 51539607552, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f19d9e46000
kurahaupo commented 4 years ago

The last thing the process is doing is binding a listening socket to port 736, and then what appears to be a single malloc request for 48 GiB, which is crazily huge.

That sounds suspiciously close to ¾ of RAM , which is the default setting for Slice.MemoryHigh in src/ws/system-cockpithttps.slice. However I cannot see anywhere that references that limit, so it's unclear whether changing it would help (MemoryHigh did not exist when this bug was reported; it was subsequently added by 81b665811501e6cbd5f77df4ebeda37405ea65b2.)

If you have 64 GiB of RAM, I would suggest:

  1. turn off (or highly restrict) over-commit in the kernel, so that requests for memory will be rejected (malloc() returns NULL) rather than later being OOM-killed.
  2. expand swap space to be more than the RAM as you really have, so that the occasional fork of a very large process won't fail. (Or you may decide that you want such processes to fail, and not do this.)

This might be useful reading: https://www.percona.com/blog/2019/08/02/out-of-memory-killer-or-savior/

KKoukiou commented 3 years ago

@allisonkarlitskaya @martinpitt there is a stacktrace of the problematic process, maybe you can figure this out easier now?

Just pinging because I feel this got lost in the bottom of the issue list.

martinpitt commented 3 years ago

@KKoukiou : I did look at it, but strace doesn't help -- this isn't related (apparently) to some syscall going crazy. My gut feeling is that somewhere inside gnutls a huge memory allocation happens, and computation and allocs don't appear in strace (as they are not syscalls). However, there is a really good chance that release 242 fixed that with dropping remotectl from the unit (PR #15608).

@RobbieTheK and @kurahaupo , do you have a chance to test this with 242? This is in Fedora 33 and 34 now.

SomePersonSomeWhereInTheWorld commented 3 years ago

No more OOM with cockpit-242-1.fc33.x86_64. It had been disabled but I did see these logs:

Apr 19 21:03:58 ourdomain.edu cockpit-tls[3542226]: cockpit-tls: gnutls_handshake failed: A packet with illegal or unsupported version was received.

Perhaps that's a different issue.

martinpitt commented 3 years ago

@RobbieTheK Thanks for checking! Is that actually breaking something, or just some noise from the initial browser complaint about the self-signed certificate? It usually looks more like "gnutls_handshake failed: a fatal TLS error was received" (like in issue #14896), but that may be browser specific.

SomePersonSomeWhereInTheWorld commented 3 years ago

usually looks more like "gnutls_handshake failed: a fatal TLS error was received"

I think I see what caused this one. A Qualys pen test, here are some logs:

Apr 19 21:00:12 ourworkstation systemd[1]: Starting Cockpit Web Service...
Apr 19 21:00:12 ourworkstation systemd[1]: sysstat-collect.service: Succeeded.
Apr 19 21:00:12 ourworkstation systemd[1]: Finished system activity accounting tool.
Apr 19 21:00:12 ourworkstation systemd[1]: Started Cockpit Web Service.
Apr 19 21:00:13 ourworkstation systemd[1]: Started Cockpit Web Service http-redirect instance.
Apr 19 21:00:13 ourworkstation journal[3542228]: received invalid HTTP request line
Apr 19 21:00:43 ourworkstation journal[3542228]: received invalid HTTP request line
Apr 19 21:00:51 ourworkstation cockpit-tls[3542226]: cockpit-tls: gnutls_handshake failed: The TLS connection was non-properly terminated.
Apr 19 21:00:51 ourworkstation cockpit-tls[3542226]: cockpit-tls: gnutls_handshake failed: A packet with illegal or unsupported version was received.
Apr 19 21:01:34 ourworkstation kernel: net_ratelimit: 6 callbacks suppressed
Apr 19 21:01:34 ourworkstation kernel: svc: svc_tcp_read_marker nfsd RPC fragment too large: 1509949440
Apr 19 21:00:51 ourworkstation systemd[1]: Started Cockpit Web Service https instance factory (PID 3542226/UID 970).
Apr 19 21:00:51 ourworkstation cockpit-tls[3542226]: cockpit-tls: gnutls_handshake failed: A packet with illegal or unsupported version was received.
Apr 19 21:00:51 ourworkstation journal[3542228]: Received unexpected TLS connection and no certificate was configured
Apr 19 21:00:51 ourworkstation systemd[1]: Started Cockpit Web Service https instance factory (PID 3542226/UID 970).
Apr 19 21:01:41 ourworkstation cockpit-tls[3542226]: cockpit-tls: gnutls_handshake failed: Decryption has failed.
Apr 19 21:01:42 ourworkstation kernel: svc: svc_tcp_read_marker nfsd RPC fragment too large: 305397761
Apr 19 21:01:42 ourworkstation kernel: svc: svc_tcp_read_marker nfsd RPC fragment too large: 1229866575
Apr 19 21:01:43 ourworkstation systemd[1]: cockpit-wsinstance-http-redirect.service: Succeeded.
Apr 19 21:01:50 ourworkstation kernel: svc: svc_tcp_read_marker nfsd RPC fragment too large: 1090519040
Apr 19 21:01:51 ourworkstation cockpit-tls[3542226]: cockpit-tls: gnutls_handshake failed: Decryption has failed.
Apr 19 21:01:52 ourworkstation journal[3542357]: received invalid HTTP request line
Apr 19 21:02:01 ourworkstation journal[3542357]: received invalid HTTP request line
Apr 19 21:02:01 ourworkstation journal[3542357]: received invalid HTTP request line
Apr 19 21:02:01 ourworkstation journal[3542357]: received invalid HTTP request line
Apr 19 21:02:01 ourworkstation journal[3542357]: received invalid HTTP request line
Apr 19 21:02:01 ourworkstation cockpit-tls[3542226]: cockpit-tls: gnutls_handshake failed: Decryption has failed.
Apr 19 21:02:06 ourworkstation kernel: svc: svc_tcp_read_marker nfsd RPC fragment too large: 469762048
Apr 19 21:02:09 ourworkstation journal[3542357]: received invalid HTTP request line
Apr 19 21:02:11 ourworkstation cockpit-tls[3542226]: cockpit-tls: gnutls_handshake failed: Decryption has failed.
Apr 19 21:02:14 ourworkstation kernel: svc: svc_tcp_read_marker nfsd RPC fragment too large: 1247096314
Apr 19 21:02:14 ourworkstation kernel: svc: svc_tcp_read_marker nfsd RPC fragment too large: 369295616
Apr 19 21:02:17 ourworkstation journal[3542357]: received HTTP request without Host header
Apr 19 21:02:21 ourworkstation systemd[1]: cockpit-wsinstance-https@e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.service: Succeeded.
Apr 19 21:02:21 ourworkstation systemd[1]: cockpit-wsinstance-https@e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.socket: Succeeded.
Apr 19 21:02:21 ourworkstation systemd[1]: Closed Socket for Cockpit Web Service https instance e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.

And then from /var/log/secure:

Apr 19 21:00:09 ourworkstation sshd[3542210]: pam_sss(sshd:auth): Request to sssd failed. Connection refused
Apr 19 21:00:11 ourworkstation sshd[3542210]: Failed password for s-qualys from qu.al.ys.ip port 38587 ssh2
Apr 19 21:00:13 ourworkstation sshd[3542210]: Connection closed by authenticating user s-qualys qu.al.ys.ip port 38587 [preauth]
Apr 19 21:00:13 ourworkstation sshd[3542232]: Connection closed by authenticating user s-qualys qu.al.ys.ip port 38859 [preauth]
Apr 19 21:01:10 ourworkstation sshd[3542373]: error: kex_exchange_identification: read: Connection reset by peer
Apr 19 21:01:10 ourworkstation sshd[3542373]: Connection reset by qu.al.ys.ip port 42639
Apr 19 21:01:27 ourworkstation sshd[3542422]: error: kex_exchange_identification: read: Connection reset by peer
Apr 19 21:01:27 ourworkstation sshd[3542422]: Connection reset by qu.al.ys.ip port 33658
Apr 19 21:01:35 ourworkstation sshd[3542460]: error: kex_exchange_identification: read: Connection reset by peer
Apr 19 21:01:35 ourworkstation sshd[3542460]: Connection reset by qu.al.ys.ip port 48636
Apr 19 21:01:47 ourworkstation sshd[3542537]: error: kex_exchange_identification: read: Connection reset by peer
Apr 19 21:01:47 ourworkstation sshd[3542537]: Connection reset by qu.al.ys.ip port 56166
Apr 19 21:04:10 ourworkstation sshd[3550610]: Invalid user NoSuchUser from qu.al.ys.ip port 44074

Perhaps there may be a way to suppress or hide these logs from 'normal' and just in 'debug' mode? I guess that'd be a feature request.

martinpitt commented 3 years ago

Yes, I think I want to hide these. This is tracked in #14896 already.

So I am closing this one now, as it was fixed with PR #15608. Thanks for confirming!

kurahaupo commented 3 years ago

@martinpitt sorry I don't have cockpit installed. I will look into it when I do.

I would agree on your point about a library making a huge allocation; probably some sizing calculation gone wrong. In case I didn't make it clear before, that huge allocation was for exactly 48GiB (0xc00000000), which seems suspicious to me.