Open pokotilenko opened 4 years ago
I caught redis-server being in Uninterruptible Sleep (D) State (Blocked Waiting for I/O) with a script running for few hours:
while true; do grep -vE '\(redis-server\) (S|R)' /proc/20188/stat; sleep 0.005; done
20188 (redis-server) D 1 ...
20188 (redis-server) D 1 ...
20188 (redis-server) D 1 ...
20188 (redis-server) D 1 ...
20188 (redis-server) D 1 ...
20188 (redis-server) D 1 ...
20188 (redis-server) D 1 ...
20188 (redis-server) D 1 ...
And given that it is known that pidof skips such processes it is definitely the cause and is a problem.
Applied this before official solution is found https://github.com/ClusterLabs/resource-agents/pull/616/commits/89a6f431588b7d26d2f90ba7ea48e371df4e453e
What distro/system/environment is this? Because on my Fedora this seems to work just fine: [root@holtby ~]# ps auxwf |grep uninter root 769239 0.0 0.0 2200 512 pts/15 D+ 12:01 0:00 | _ ./uninterruptible root 769240 0.0 0.0 2200 512 pts/15 S+ 12:01 0:00 | _ ./uninterruptible
[root@holtby ~]# pidof uninterruptible 769240 769239 procps-ng-3.3.15 seems to not filter out 'D' processes at least here.
I also saw no mention of any '-z' in pidof upstream on https://gitlab.com/procps-ng/procps
This is Debian 10 (Buster)
/bin/pidof from sysvinit-utils 2.93-8 package: https://packages.debian.org/en/buster/sysvinit-utils https://packages.debian.org/en/buster/amd64/sysvinit-utils/filelist
pidof man pages: Buster (no -z, but D and Z ignored): https://manpages.debian.org/buster/sysvinit-utils/pidof.8.en.html Testing (-z present, D and Z ignorance documented): https://manpages.debian.org/testing/sysvinit-utils/pidof.8.en.html
It seems in Fedora pidof comes from procps-ng package which may have different implementation/behaviour.
On Debian procps-ng comes as just procps, it's changelog states that: https://metadata.ftp-master.debian.org/changelogs//main/p/procps/procps_3.3.15-2_changelog procps (1:3.3.0-1) unstable; urgency=low
But pidof is not supplied with this package neither in Buster nor Testing: https://packages.debian.org/buster/amd64/procps/filelist
Debian builds procps with "--disable-pidof"
Here is Debian bug: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=926896
Ah now that explains it, thanks for the additional info. Yeah ideally this is solved at distro-level. Am a bit surprised tbh because changing the semantics of pidof to 'pids of processes not stuck in D' is probably going to silently break a ton of stuff
I have this bug on CentOS Linux release 8.1.1911 (Core) 7 redis are running on my server
741 ? Ssl 2:10 /usr/bin/redis-server 127.0.0.1:6385 2794 ? Ssl 3:33 /usr/bin/redis-server 127.0.0.1:6381 3502 ? Ssl 1:07 /usr/bin/redis-server 127.0.0.1:6384 7789 ? Ssl 0:23 /usr/bin/redis-server 127.0.0.1:6383 9892 ? Ssl 1:14 /usr/bin/redis-server 127.0.0.1:6379 16546 ? Ssl 0:45 /usr/bin/redis-server 127.0.0.1:6380 30887 ? Ssl 1:19 /usr/bin/redis-server 127.0.0.1:6382
I run script with this command https://github.com/ClusterLabs/resource-agents/blob/b2dcccf1275727b0873e990fb123b46be536d608/heartbeat/redis.in#L371
REDIS_SERVER="/usr/bin/redis-server" pid=7789 while true; do cmd_res=$(pidof $(basename "$REDIS_SERVER")) echo "$(date "+%Y-%m-%d %H:%M:%S.%N") $cmd_res" echo "$cmd_res" | grep -q "\<$pid>" || echo "absent $pid" sleep 0.02 done
And sometimes see a follow output 2020-06-23 06:56:11.124601828 30887 16546 9892 7789 3502 2794 741 2020-06-23 06:56:11.162999065 30887 16546 9892 7789 3502 2794 741 2020-06-23 06:56:11.202587077 30887 16546 9892 7789 3502 2794 741 2020-06-23 06:56:11.238393671 3502 2794 741 absent 7789 2020-06-23 06:56:11.284904179 30887 16546 9892 7789 3502 2794 741 2020-06-23 06:56:11.335150357 30887 16546 9892 7789 3502 2794 741 2020-06-23 06:56:11.383954529 30887 16546 9892 7789 3502 2794 741 2020-06-23 06:56:11.430647416 30887 16546 9892 7789 3502 2794 741
I have a cluster where pacemaker sometimes restarts Redis.
Diggging in showed that RA reports OCF_NOT_RUNNING while redis-server is actually running Ok. This is causing pacemaker to recover (restart) resource. In my case this happens once a week or so.
I've added some debuging in RA script and determined actual line responsible for this: https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/redis.in#L371
So I also monitored $REDIS_PIDFILE for change, it was not changing, it contained correct pid of redis server. Debug output that I added show that $pid have correct pid, $REDIS_SERVER is correct, but pidof does not output pid of running redis-server sometimes for some reason.
Here is link that I think explains why pidof may sometimes not include pid of redis-serve in output: https://unix.stackexchange.com/questions/518411/why-is-pidof-not-working/518412#518412
As described in the link, pidof does not include pids of processes in Uninterruptible Sleep (D) State (Blocked Waiting for I/O).
Also, I was able to catch "pidof miss" with this command (after a while it printed "absent"):
# while true; do pidof -q /usr/bin/redis-server || echo absent; sleep 0.02; done
absent
Recent pidof have -z option to not skip D and Z processed. But you can't skip Z and don't skip D processes. Also manpage for this option sais that not skipping D and Z can sometimes lead to hang of pidof
There was similar bug related to pidof and prelink: https://github.com/ClusterLabs/resource-agents/pull/616
There is commit to replace functionality of pidof: https://github.com/ClusterLabs/resource-agents/pull/616/commits/89a6f431588b7d26d2f90ba7ea48e371df4e453e
Not sure if it's good way to go