Closed SomePersonSomeWhereInTheWorld closed 3 years ago
Thanks for your report! So remotectl eats away all your memory, but the log doesn't say why. As this is specific to your machine and not reproducible on others (at least not without further information), could you at least temporarily install cockpit-ws
again? If you don't enable cockpit.socket, it will not actually do anything. Merely installing the package won't enable it, but just in case, please make sure that systemctl is-enabled cockpit.socket
says "disabled".
After that, let's first try to run this as normal user in a temporary directory, which is safer:
mkdir -p /tmp/x/cockpit/ws-certs.d
G_MESSAGES_DEBUG=all XDG_CONFIG_DIRS=/tmp/x remotectl certificate --ensure
What's the output of this? It should normally take only a few seconds. It fails because of a permission error at the end, but that's fine. If it goes haywire, please Control-C it.
The expected output is something like
** INFO: 12:59:09.778: Generating temporary certificate using: sscg --quiet --lifetime 3650 --key-strength 2048 --cert-key-file /tmp/x/cockpit/ws-certs.d/0-self-signed.cert --cert-file /tmp/x/cockpit/ws-certs.d/0-self-signed.cert --ca-file /tmp/x/cockpit/ws-certs.d/0-self-signed-ca.pem --hostname donald --organization 607e9444bd2e4594ab570d4df4bd766a --subject-alt-name localhost --subject-alt-name IP:127.0.0.1/255.255.255.255
(remotectl:22394): GLib-DEBUG: 12:59:09.779: posix_spawn avoided (fd close requested)
(remotectl:22394): GLib-GIO-DEBUG: 12:59:11.095: _g_io_module_get_default: Found default implementation gnutls (GTlsBackendGnutls) for ?gio-tls-backend?
** (remotectl:22394): DEBUG: 12:59:11.095: loaded 1 certificates from /tmp/x/cockpit/ws-certs.d/0-self-signed.cert
remotectl: couldn't set certificate ownership: /tmp/x/cockpit/ws-certs.d/0-self-signed.cert: Operation not permitted
Afterwards there should be a 0-self-signed-ca.pem and 0-self-signed.cert in ls -l /tmp/x/cockpit/
. Can you please copy&paste the output of that as well?
Thanks!
So remotectl eats away all your memory
The logs refer to it as emotectl
, is that intentional?
systemctl is-enabled cockpit.socket says "disabled".
systemctl status cockpit.socket
● cockpit.socket - Cockpit Web Service Socket
Loaded: loaded (/usr/lib/systemd/system/cockpit.socket; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:cockpit-ws(8)
Listen: [::]:9090 (Stream)
Oct 04 13:28:28 systemd[1]: Starting Cockpit Web Service Socket.
Oct 04 13:28:28 systemd[1]: Listening on Cockpit Web Service Socket.
Oct 09 12:27:46 systemd[1]: Stopping Cockpit Web Service Socket.
Oct 09 12:27:46 systemd[1]: cockpit.socket: Succeeded.
Oct 09 12:27:46 systemd[1]: Closed Cockpit Web Service Socket.
G_MESSAGES_DEBUG=all XDG_CONFIG_DIRS=/tmp/x remotectl certificate --ensure
$ G_MESSAGES_DEBUG=all XDG_CONFIG_DIRS=/tmp/x remotectl certificate --ensure
** INFO: 09:40:42.871: Generating temporary certificate using: sscg --quiet --lifetime 3650 --key-strength 2048 --cert-key-file /tmp/x/cockpit/ws-certs.d/0-self-signed.cert --cert-file /tmp/x/cockpit/ws-certs.d/0-self-signed.cert --ca-file /tmp/x/cockpit/ws-certs.d/0-self-signed-ca.pem --hostname ourdomain.edu --organization 07f397410f92404a9ae1e45b67b62b5f --subject-alt-name localhost --subject-alt-name IP:127.0.0.1/255.255.255.255
(remotectl:10491): GLib-DEBUG: 09:40:42.872: posix_spawn avoided (fd close requested)
(remotectl:10491): GLib-GIO-DEBUG: 09:40:43.544: _g_io_module_get_default: Found default implementation gnutls (GTlsBackendGnutls) for ?gio-tls-backend?
** (remotectl:10491): DEBUG: 09:40:43.544: loaded 1 certificates from /tmp/x/cockpit/ws-certs.d/0-self-signed.cert
remotectl: couldn't set certificate ownership: /tmp/x/cockpit/ws-certs.d/0-self-signed.cert: Operation not permitted
ls -l /tmp/x/cockpit/
total 0
drwxrwxr-x 2 localguy localguy 80 Oct 10 09:40 ws-certs.d
ls -l /tmp/x/cockpit/ws-certs.d/0-self-signed*
-rw-r--r-- 1 localguy localguy 2199 Oct 10 09:40 /tmp/x/cockpit/ws-certs.d/0-self-signed-ca.pem
-rw------- 1 localguy localguy 3436 Oct 10 09:40 /tmp/x/cockpit/ws-certs.d/0-self-signed.cert
The logs refer to it as emotectl, is that intentional?
Not really by design, but it's a quirk how Linux works. It puts parentheses around died processes in /proc/pid/comm, and since it can't change the string size, the first and last character gets chopped off.
So it seems that remotectl invocation as user went fine? Next step, can you please check if you have a system-wide cockpit certificate? You should have if you ever actually used cockpit. I. e sudo ls -l /etc /cockpit/ws-certs.d/
. There should be the same two files you previously had in /tmp/x/.
After you captured this, please run
sudo G_MESSAGES_DEBUG=all remotectl certificate --ensure --user=root --group=cockpit-ws --selinux-type=etc_t
copy&paste the output, and observe how long this takes. With existing certs, it should be instantanous, and say "loaded 1 certificates". Otherwise it should generate new ones like in the experiment above.
sudo ls -l /etc /cockpit/ws-certs.d/
-rw-r--r--. 1 root root 2216 Oct 1 2018 0-self-signed-ca.pem
-rw-r-----. 1 root cockpit-ws 3452 Oct 1 2018 0-self-signed.cert
sudo G_MESSAGES_DEBUG=all remotectl certificate --ensure --user=root --group=cockpit-ws --selinux-type=etc_t
observe how long this takes
Instantly:
sudo G_MESSAGES_DEBUG=all remotectl certificate --ensure --user=root --group=cockpit-ws --selinux-type=etc_t
(remotectl:14592): GLib-GIO-DEBUG: 10:11:54.844: _g_io_module_get_default: Found default implementation gnutls (GTlsBackendGnutls) for ?gio-tls-backend?
** (remotectl:14592): DEBUG: 10:11:54.844: loaded 1 certificates from /etc/cockpit/ws-certs.d/0-self-signed.cert
(remotectl:14592): GLib-DEBUG: 10:11:54.845: posix_spawn avoided (fd close requested)
Hmm.. So do you still get the rogue process and OOM if you do sudo systemctl start cockpit
? That should also return instantly, if it hangs, then please check top/ps whether remotectl is acting up. If it does, sudo systemctl status cockpit
would be useful.
So do you still get the rogue process and OOM if you do sudo systemctl start cockpit? That should also return instantly
It hangs. And yes another OOM.
sudo systemctl start cockpit
Job for cockpit.service failed because a timeout was exceeded.
See "systemctl status cockpit.service" and "journalctl -xe" for details.
systemctl status cockpit.service
● cockpit.service - Cockpit Web Service
Loaded: loaded (/usr/lib/systemd/system/cockpit.service; static; vendor preset: disabled)
Active: failed (Result: timeout) since Thu 2019-10-10 10:30:03 EDT; 18s ago
Docs: man:cockpit-ws(8)
Process: 16908 ExecStartPre=/usr/sbin/remotectl certificate --ensure --user=root --group=cockpit-ws --selinux-type=etc_t (code=killed, signal=KILL)
Oct 10 10:27:18 systemd[1]: Starting Cockpit Web Service...
Oct 10 10:28:48 systemd[1]: cockpit.service: Start-pre operation timed out. Terminating.
Oct 10 10:30:03 systemd[1]: cockpit.service: Control process exited, code=killed, status=9/KILL
Oct 10 10:30:03 systemd[1]: cockpit.service: Failed with result 'timeout'.
Oct 10 10:30:03 systemd[1]: Failed to start Cockpit Web Service.
Perhaps you can tell my how to force a coredump and run gdb backtrace
on it?
I had to disable Cockpit again. I was able to capture a strace of the process if that helps:
strace: Process 18368 attached
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(736), sin_addr=inet_addr("0.0.0.0")}, [128->16]) = 0
getsockopt(5, SOL_SOCKET, SO_TYPE, [2], [4]) = 0
getpid() = 18368
setsockopt(5, SOL_IP, IP_RECVERR, [1], 4) = 0
ioctl(5, FIONBIO, [1]) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
sendto(5, "]\241\325\265\0\0\0\0\0\0\0\2\0\1\206\240\0\0\0\4\0\0\0\3\0\0\0\0\0\0\0\0"..., 92, 0, {sa_family=AF_INET, sin_port=htons(111), sin_addr=inet_addr("150.108.64.52")}, 16) = 92
poll([{fd=5, events=POLLIN}], 1, 15000) = 1 ([{fd=5, revents=POLLIN}])
recvfrom(5, "]\241\325\265\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\023150."..., 8800, 0, NULL, NULL) = 48
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
close(5) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP) = 5
getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, [128->16]) = 0
getsockopt(5, SOL_SOCKET, SO_TYPE, [2], [4]) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, [128->16]) = 0
getpid() = 18368
bind(5, {sa_family=AF_INET, sin_port=htons(736), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(736), sin_addr=inet_addr("0.0.0.0")}, [128->16]) = 0
getsockopt(5, SOL_SOCKET, SO_TYPE, [2], [4]) = 0
getpid() = 18368
setsockopt(5, SOL_IP, IP_RECVERR, [1], 4) = 0
ioctl(5, FIONBIO, [1]) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
close(4) = 0
close(3) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
close(5) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
openat(AT_FDCWD, "/etc/netconfig", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=767, ...}) = 0
read(3, "#\n# The network configuration fi"..., 4096) = 767
openat(AT_FDCWD, "/etc/services", O_RDONLY|O_CLOEXEC) = 4
lseek(4, 0, SEEK_CUR) = 0
fstat(4, {st_mode=S_IFREG|0644, st_size=692323, ...}) = 0
read(4, "# /etc/services:\n# $Id: services"..., 4096) = 4096
lseek(4, 0, SEEK_CUR) = 4096
read(4, "deWeb HTTP\nhttp 80/ud"..., 4096) = 4096
close(4) = 0
stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=101, ...}) = 0
openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_CLOEXEC) = 4
lseek(4, 0, SEEK_CUR) = 0
fstat(4, {st_mode=S_IFREG|0644, st_size=353, ...}) = 0
read(4, "127.0.0.1 localhost localhost."..., 4096) = 353
lseek(4, 0, SEEK_CUR) = 353
read(4, "", 4096) = 0
close(4) = 0
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 4
getsockname(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, [128->16]) = 0
getsockopt(4, SOL_SOCKET, SO_TYPE, [1], [4]) = 0
getsockname(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, [128->16]) = 0
getpid() = 18368
bind(4, {sa_family=AF_INET, sin_port=htons(736), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
mmap(NULL, 4294967296, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f25d9e46000
mmap(NULL, 51539607552, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f19d9e46000
The last thing the process is doing is binding a listening socket to port 736, and then what appears to be a single malloc request for 48 GiB, which is crazily huge.
That sounds suspiciously close to ¾ of RAM , which is the default setting for (Slice.MemoryHigh
in src/ws/system-cockpithttps.slice. However I cannot see anywhere that references that limit, so it's unclear whether changing it would helpMemoryHigh
did not exist when this bug was reported; it was subsequently added by 81b665811501e6cbd5f77df4ebeda37405ea65b2.)
If you have 64 GiB of RAM, I would suggest:
malloc()
returns NULL
) rather than later being OOM-killed.fork
of a very large process won't fail. (Or you may decide that you want such processes to fail, and not do this.)This might be useful reading: https://www.percona.com/blog/2019/08/02/out-of-memory-killer-or-savior/
@allisonkarlitskaya @martinpitt there is a stacktrace of the problematic process, maybe you can figure this out easier now?
Just pinging because I feel this got lost in the bottom of the issue list.
@KKoukiou : I did look at it, but strace doesn't help -- this isn't related (apparently) to some syscall going crazy. My gut feeling is that somewhere inside gnutls a huge memory allocation happens, and computation and allocs don't appear in strace (as they are not syscalls). However, there is a really good chance that release 242 fixed that with dropping remotectl from the unit (PR #15608).
@RobbieTheK and @kurahaupo , do you have a chance to test this with 242? This is in Fedora 33 and 34 now.
No more OOM with cockpit-242-1.fc33.x86_64
. It had been disabled but I did see these logs:
Apr 19 21:03:58 ourdomain.edu cockpit-tls[3542226]: cockpit-tls: gnutls_handshake failed: A packet with illegal or unsupported version was received.
Perhaps that's a different issue.
@RobbieTheK Thanks for checking! Is that actually breaking something, or just some noise from the initial browser complaint about the self-signed certificate? It usually looks more like "gnutls_handshake failed: a fatal TLS error was received" (like in issue #14896), but that may be browser specific.
usually looks more like "gnutls_handshake failed: a fatal TLS error was received"
I think I see what caused this one. A Qualys pen test, here are some logs:
Apr 19 21:00:12 ourworkstation systemd[1]: Starting Cockpit Web Service...
Apr 19 21:00:12 ourworkstation systemd[1]: sysstat-collect.service: Succeeded.
Apr 19 21:00:12 ourworkstation systemd[1]: Finished system activity accounting tool.
Apr 19 21:00:12 ourworkstation systemd[1]: Started Cockpit Web Service.
Apr 19 21:00:13 ourworkstation systemd[1]: Started Cockpit Web Service http-redirect instance.
Apr 19 21:00:13 ourworkstation journal[3542228]: received invalid HTTP request line
Apr 19 21:00:43 ourworkstation journal[3542228]: received invalid HTTP request line
Apr 19 21:00:51 ourworkstation cockpit-tls[3542226]: cockpit-tls: gnutls_handshake failed: The TLS connection was non-properly terminated.
Apr 19 21:00:51 ourworkstation cockpit-tls[3542226]: cockpit-tls: gnutls_handshake failed: A packet with illegal or unsupported version was received.
Apr 19 21:01:34 ourworkstation kernel: net_ratelimit: 6 callbacks suppressed
Apr 19 21:01:34 ourworkstation kernel: svc: svc_tcp_read_marker nfsd RPC fragment too large: 1509949440
Apr 19 21:00:51 ourworkstation systemd[1]: Started Cockpit Web Service https instance factory (PID 3542226/UID 970).
Apr 19 21:00:51 ourworkstation cockpit-tls[3542226]: cockpit-tls: gnutls_handshake failed: A packet with illegal or unsupported version was received.
Apr 19 21:00:51 ourworkstation journal[3542228]: Received unexpected TLS connection and no certificate was configured
Apr 19 21:00:51 ourworkstation systemd[1]: Started Cockpit Web Service https instance factory (PID 3542226/UID 970).
Apr 19 21:01:41 ourworkstation cockpit-tls[3542226]: cockpit-tls: gnutls_handshake failed: Decryption has failed.
Apr 19 21:01:42 ourworkstation kernel: svc: svc_tcp_read_marker nfsd RPC fragment too large: 305397761
Apr 19 21:01:42 ourworkstation kernel: svc: svc_tcp_read_marker nfsd RPC fragment too large: 1229866575
Apr 19 21:01:43 ourworkstation systemd[1]: cockpit-wsinstance-http-redirect.service: Succeeded.
Apr 19 21:01:50 ourworkstation kernel: svc: svc_tcp_read_marker nfsd RPC fragment too large: 1090519040
Apr 19 21:01:51 ourworkstation cockpit-tls[3542226]: cockpit-tls: gnutls_handshake failed: Decryption has failed.
Apr 19 21:01:52 ourworkstation journal[3542357]: received invalid HTTP request line
Apr 19 21:02:01 ourworkstation journal[3542357]: received invalid HTTP request line
Apr 19 21:02:01 ourworkstation journal[3542357]: received invalid HTTP request line
Apr 19 21:02:01 ourworkstation journal[3542357]: received invalid HTTP request line
Apr 19 21:02:01 ourworkstation journal[3542357]: received invalid HTTP request line
Apr 19 21:02:01 ourworkstation cockpit-tls[3542226]: cockpit-tls: gnutls_handshake failed: Decryption has failed.
Apr 19 21:02:06 ourworkstation kernel: svc: svc_tcp_read_marker nfsd RPC fragment too large: 469762048
Apr 19 21:02:09 ourworkstation journal[3542357]: received invalid HTTP request line
Apr 19 21:02:11 ourworkstation cockpit-tls[3542226]: cockpit-tls: gnutls_handshake failed: Decryption has failed.
Apr 19 21:02:14 ourworkstation kernel: svc: svc_tcp_read_marker nfsd RPC fragment too large: 1247096314
Apr 19 21:02:14 ourworkstation kernel: svc: svc_tcp_read_marker nfsd RPC fragment too large: 369295616
Apr 19 21:02:17 ourworkstation journal[3542357]: received HTTP request without Host header
Apr 19 21:02:21 ourworkstation systemd[1]: cockpit-wsinstance-https@e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.service: Succeeded.
Apr 19 21:02:21 ourworkstation systemd[1]: cockpit-wsinstance-https@e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.socket: Succeeded.
Apr 19 21:02:21 ourworkstation systemd[1]: Closed Socket for Cockpit Web Service https instance e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.
And then from /var/log/secure
:
Apr 19 21:00:09 ourworkstation sshd[3542210]: pam_sss(sshd:auth): Request to sssd failed. Connection refused
Apr 19 21:00:11 ourworkstation sshd[3542210]: Failed password for s-qualys from qu.al.ys.ip port 38587 ssh2
Apr 19 21:00:13 ourworkstation sshd[3542210]: Connection closed by authenticating user s-qualys qu.al.ys.ip port 38587 [preauth]
Apr 19 21:00:13 ourworkstation sshd[3542232]: Connection closed by authenticating user s-qualys qu.al.ys.ip port 38859 [preauth]
Apr 19 21:01:10 ourworkstation sshd[3542373]: error: kex_exchange_identification: read: Connection reset by peer
Apr 19 21:01:10 ourworkstation sshd[3542373]: Connection reset by qu.al.ys.ip port 42639
Apr 19 21:01:27 ourworkstation sshd[3542422]: error: kex_exchange_identification: read: Connection reset by peer
Apr 19 21:01:27 ourworkstation sshd[3542422]: Connection reset by qu.al.ys.ip port 33658
Apr 19 21:01:35 ourworkstation sshd[3542460]: error: kex_exchange_identification: read: Connection reset by peer
Apr 19 21:01:35 ourworkstation sshd[3542460]: Connection reset by qu.al.ys.ip port 48636
Apr 19 21:01:47 ourworkstation sshd[3542537]: error: kex_exchange_identification: read: Connection reset by peer
Apr 19 21:01:47 ourworkstation sshd[3542537]: Connection reset by qu.al.ys.ip port 56166
Apr 19 21:04:10 ourworkstation sshd[3550610]: Invalid user NoSuchUser from qu.al.ys.ip port 44074
Perhaps there may be a way to suppress or hide these logs from 'normal' and just in 'debug' mode? I guess that'd be a feature request.
Yes, I think I want to hide these. This is tracked in #14896 already.
So I am closing this one now, as it was fixed with PR #15608. Thanks for confirming!
@martinpitt sorry I don't have cockpit installed. I will look into it when I do.
I would agree on your point about a library making a huge allocation; probably some sizing calculation gone wrong. In case I didn't make it clear before, that huge allocation was for exactly 48GiB (0xc00000000), which seems suspicious to me.
I created a bug upstream for Fedora 30. I had to uninstall all the Cockpit services. Even after I ran
systemctl disable cockpiit
, it would restart, or try to restart after a reboot. A couple times an hour theemotectl
service would get killed via the out-of-memory killer process. I ran the Dell hardware diagnostics a couple of times, i.e., in the iDRAC, and no errors.If it helps to know we still use NIS and this is a backup server so rsnapshot/rsync runs throughout the day.
And at the end of
journalctl -u cockpit
are just these entries:After uninstalling, it's been 5 hours without an OOM, when there were 2-3 an hour.