ClusterLabs / resource-agents

Combined repository of OCF agents from the RHCS and Linux-HA projects
GNU General Public License v2.0
493 stars 582 forks source link

redis: RA sometimes reports false OCF_NOT_RUNNING #1491

Open pokotilenko opened 4 years ago

pokotilenko commented 4 years ago

I have a cluster where pacemaker sometimes restarts Redis.

Diggging in showed that RA reports OCF_NOT_RUNNING while redis-server is actually running Ok. This is causing pacemaker to recover (restart) resource. In my case this happens once a week or so.

I've added some debuging in RA script and determined actual line responsible for this: https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/redis.in#L371

So I also monitored $REDIS_PIDFILE for change, it was not changing, it contained correct pid of redis server. Debug output that I added show that $pid have correct pid, $REDIS_SERVER is correct, but pidof does not output pid of running redis-server sometimes for some reason.

Here is link that I think explains why pidof may sometimes not include pid of redis-serve in output: https://unix.stackexchange.com/questions/518411/why-is-pidof-not-working/518412#518412

As described in the link, pidof does not include pids of processes in Uninterruptible Sleep (D) State (Blocked Waiting for I/O).

Also, I was able to catch "pidof miss" with this command (after a while it printed "absent"): # while true; do pidof -q /usr/bin/redis-server || echo absent; sleep 0.02; done absent

Recent pidof have -z option to not skip D and Z processed. But you can't skip Z and don't skip D processes. Also manpage for this option sais that not skipping D and Z can sometimes lead to hang of pidof

There was similar bug related to pidof and prelink: https://github.com/ClusterLabs/resource-agents/pull/616

There is commit to replace functionality of pidof: https://github.com/ClusterLabs/resource-agents/pull/616/commits/89a6f431588b7d26d2f90ba7ea48e371df4e453e

Not sure if it's good way to go

pokotilenko commented 4 years ago

I caught redis-server being in Uninterruptible Sleep (D) State (Blocked Waiting for I/O) with a script running for few hours:

while true; do grep -vE '\(redis-server\) (S|R)' /proc/20188/stat; sleep 0.005; done
20188 (redis-server) D 1 ...
20188 (redis-server) D 1 ...
20188 (redis-server) D 1 ...
20188 (redis-server) D 1 ...
20188 (redis-server) D 1 ...
20188 (redis-server) D 1 ...
20188 (redis-server) D 1 ...
20188 (redis-server) D 1 ...

And given that it is known that pidof skips such processes it is definitely the cause and is a problem.

pokotilenko commented 4 years ago

Applied this before official solution is found https://github.com/ClusterLabs/resource-agents/pull/616/commits/89a6f431588b7d26d2f90ba7ea48e371df4e453e

mbaldessari commented 4 years ago

What distro/system/environment is this? Because on my Fedora this seems to work just fine: [root@holtby ~]# ps auxwf |grep uninter root 769239 0.0 0.0 2200 512 pts/15 D+ 12:01 0:00 | _ ./uninterruptible root 769240 0.0 0.0 2200 512 pts/15 S+ 12:01 0:00 | _ ./uninterruptible

[root@holtby ~]# pidof uninterruptible 769240 769239 procps-ng-3.3.15 seems to not filter out 'D' processes at least here.

I also saw no mention of any '-z' in pidof upstream on https://gitlab.com/procps-ng/procps

pokotilenko commented 4 years ago

This is Debian 10 (Buster)

/bin/pidof from sysvinit-utils 2.93-8 package: https://packages.debian.org/en/buster/sysvinit-utils https://packages.debian.org/en/buster/amd64/sysvinit-utils/filelist

pidof man pages: Buster (no -z, but D and Z ignored): https://manpages.debian.org/buster/sysvinit-utils/pidof.8.en.html Testing (-z present, D and Z ignorance documented): https://manpages.debian.org/testing/sysvinit-utils/pidof.8.en.html

It seems in Fedora pidof comes from procps-ng package which may have different implementation/behaviour.

On Debian procps-ng comes as just procps, it's changelog states that: https://metadata.ftp-master.debian.org/changelogs//main/p/procps/procps_3.3.15-2_changelog procps (1:3.3.0-1) unstable; urgency=low

But pidof is not supplied with this package neither in Buster nor Testing: https://packages.debian.org/buster/amd64/procps/filelist

Debian builds procps with "--disable-pidof"

pokotilenko commented 4 years ago

Here is Debian bug: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=926896

mbaldessari commented 4 years ago

Ah now that explains it, thanks for the additional info. Yeah ideally this is solved at distro-level. Am a bit surprised tbh because changing the semantics of pidof to 'pids of processes not stuck in D' is probably going to silently break a ton of stuff

VolkDS commented 4 years ago

I have this bug on CentOS Linux release 8.1.1911 (Core) 7 redis are running on my server

741 ? Ssl 2:10 /usr/bin/redis-server 127.0.0.1:6385 2794 ? Ssl 3:33 /usr/bin/redis-server 127.0.0.1:6381 3502 ? Ssl 1:07 /usr/bin/redis-server 127.0.0.1:6384 7789 ? Ssl 0:23 /usr/bin/redis-server 127.0.0.1:6383 9892 ? Ssl 1:14 /usr/bin/redis-server 127.0.0.1:6379 16546 ? Ssl 0:45 /usr/bin/redis-server 127.0.0.1:6380 30887 ? Ssl 1:19 /usr/bin/redis-server 127.0.0.1:6382

I run script with this command https://github.com/ClusterLabs/resource-agents/blob/b2dcccf1275727b0873e990fb123b46be536d608/heartbeat/redis.in#L371

REDIS_SERVER="/usr/bin/redis-server" pid=7789 while true; do cmd_res=$(pidof $(basename "$REDIS_SERVER")) echo "$(date "+%Y-%m-%d %H:%M:%S.%N") $cmd_res" echo "$cmd_res" | grep -q "\<$pid>" || echo "absent $pid" sleep 0.02 done

And sometimes see a follow output 2020-06-23 06:56:11.124601828 30887 16546 9892 7789 3502 2794 741 2020-06-23 06:56:11.162999065 30887 16546 9892 7789 3502 2794 741 2020-06-23 06:56:11.202587077 30887 16546 9892 7789 3502 2794 741 2020-06-23 06:56:11.238393671 3502 2794 741 absent 7789 2020-06-23 06:56:11.284904179 30887 16546 9892 7789 3502 2794 741 2020-06-23 06:56:11.335150357 30887 16546 9892 7789 3502 2794 741 2020-06-23 06:56:11.383954529 30887 16546 9892 7789 3502 2794 741 2020-06-23 06:56:11.430647416 30887 16546 9892 7789 3502 2794 741