storage-mon: storage_mon updates attribute, waits for child to finish

inouekazu commented 1 year ago

This patch is intended to avoid the issue (https://github.com/ClusterLabs/resource-agents/issues/1809) of an increasing number of child processes that do not finish.

knet-ci-bot commented 1 year ago

Can one of the admins verify this patch?

wenningerk commented 1 year ago

I do understand how you moved updating the status-attribute from the RA to the storage_mon-tool to give an update to the cluster without having to return from the RA (and the storage_mon-tool as a precondition to that). But does that really prevent creation of a lot of defunct-processes as pacemaker is gonna shoot the RA after timeout. Of course selecting a large timeout you can still keep the rate of defunct-processes under control somehow. Don't remember how pacemaker will handle the case where timeout > monitor-rate though.

inouekazu commented 1 year ago

If an I/O unresponsive failure occurs with the following settings:

pcs property set node-health-strategy=only-green
pcs resource create storage-mon ocf:heartbeat:storage-mon drives="/dev/mapper/mpathc"
pcs resource clone storage-mon storage-mon-clone
pcs resource meta storage-mon-clone allow-unhealthy-nodes=true
pcs stonith create fence1-ipmilan fence_ipmilan (snip)
pcs stonith create fence2-ipmilan fence_ipmilan (snip)

storage-mon monitor is executed,

UID          PID    PPID  C STIME TTY          TIME CMD
root        5526    5164  0 12:18 ?        00:00:00 /usr/bin/bash /usr/lib/ocf/resource.d/heartbeat/storage-mon monitor
root        5529    5526  0 12:18 ?        00:00:00 /usr/libexec/heartbeat/storage_mon --device /dev/mapper/mpathc --score 1 --timeout 10
root        5530    5529  0 12:18 ?        00:00:00 /usr/libexec/heartbeat/storage_mon --device /dev/mapper/mpathc --score 1 --timeout 10

and timeout occurs.


Node List:
* Node dl380g8a: online (health is RED)
* Online: [ dl380g8b ]

Full List of Resources:

Clone Set: storage-mon-clone [storage-mon]:
- storage-mon (ocf::heartbeat:storage-mon): FAILED dl380g8a
- Started: [ dl380g8b ]
fence1-ipmilan (stonith:fence_ipmilan): Started dl380g8b
fence2-ipmilan (stonith:fence_ipmilan): Stopped

Migration Summary:

Node: dl380g8a:
- storage-mon: migration-threshold=1000000 fail-count=1 last-failure='Fri Oct 14 12:20:36 2022'

Failed Resource Actions:

storage-mon_monitor_30000 on dl380g8a 'error' (1): call=17, status='Timed Out', exitreason='Resource agent did not complete within 2m', last-rc-change='Fri Oct 14 12:20:36 2022', queued=0ms, exec=0ms
```
3. Next, stop is executed,
```
UID PID PPID C STIME TTY TIME CMD root 5530 1 0 12:18 ? 00:00:00 /usr/libexec/heartbeat/storage_mon --device /dev/mapper/mpathc --score 1 --timeout 10 root 5641 5164 0 12:20 ? 00:00:00 /usr/bin/bash /usr/lib/ocf/resource.d/heartbeat/storage-mon stop root 5644 5641 0 12:20 ? 00:00:00 /usr/libexec/heartbeat/storage_mon --device /dev/mapper/mpathc --score 1 --timeout 10 root 5645 5644 0 12:20 ? 00:00:00 /usr/libexec/heartbeat/storage_mon --device /dev/mapper/mpathc --score 1 --timeout 10
```
4. and timeout also occurs. As a result, this node is fenced.
```
Node List:
Node dl380g8a: UNCLEAN (online) (health is RED)
Online: [ dl380g8b ]

Full List of Resources:

Clone Set: storage-mon-clone [storage-mon]:
- storage-mon (ocf::heartbeat:storage-mon): FAILED dl380g8a
- Started: [ dl380g8b ]
fence1-ipmilan (stonith:fence_ipmilan): Started dl380g8b
fence2-ipmilan (stonith:fence_ipmilan): Stopped

Migration Summary:

Node: dl380g8a:
- storage-mon: migration-threshold=1000000 fail-count=1000000 last-failure='Fri Oct 14 12:22:37 2022'

Failed Resource Actions:

storage-mon_stop_0 on dl380g8a 'error' (1): call=22, status='Timed Out', exitreason='Resource agent did not complete within 2m', last-rc-change='Fri Oct 14 12:20:37 2022', queued=0ms, exec=120004ms

Pending Fencing Actions:

reboot of dl380g8a pending: client=pacemaker-controld.2805184, origin=dl380g8b


I think that it is not avoidable for the process to be stuck, but it is avoidable for the process to keep increasing.

wenningerk commented 1 year ago

True! I guess there is no way around these processes kept hanging. And I didn't say I knew one or you should find one ;-) Just wanted to state that if running into timeouts and being called repeatedly by pacemaker we will still see the hanging processes piling up still - although maybe with a little bit more control than before. If you are running the health-resource in a way that a timeouting monitor and subsequent stop will lead to node-fencing you are right that there won't be much repeating done by pacemaker. Unless maybe that happens already during start. All depending on how the resource is configured to react on failures of course. In case of these health-resources, question is, if we really want them to fail and trigger immediate fencing of a node. Idea of health, as I see it, is to still keep running what is running happily on a node - even if unhealthy - while not starting any new resources and migrating away other stuff at some point.

Side I was approaching this topic from was verification of devices used by sbd poison-pill (in fence_sbd - fence-agent). In this case you definitely don't want immediate fencing of the node as everything that isn't working for now is that the node isn't able to fence another node (or not even that as long as e.g. just one out of 3 disks is failing leaving behind a hanging process). If the node is foreseen to be a target for poison-pills as well sbd-daemon will in parallel check availability of the disk(s) and trigger suicide once the node isn't able to properly access a quorate number of disks anymore. Hope that this example explains cases where it would be interesting to check for a disk, maybe creating a hanging process, while there is still the need for the node to survive and of course we still would like for those processes not to pile up. So I thought we could gather ideas how to tackle these cases.

Things I was thinking of would be to return (with a negative monitor result or not) before a timeout but keeping the process ids of still running processes memorized somehow/where as to properly react with the next monitor call. Other approach would be to keep a daemon running that keeps this data and possibly re-trigger tests and have the actual RA to just communicate with that daemon. iirc the latter pattern has been discussed with certain kinds of resources earlier already (with the plugin-interface of the legacy fenced-implementation it was even possible to run such a daemon as part of fenced). In the above case of sbd poison-pill devices it of course comes handy that there usually is already a daemon running so that we could easily instead of directly checking a device, ask that already running daemon, that is anyway doing these checks periodically (at quite a high frequency that can hardly be achieved with pacemaker resource monitoring btw.).

Excuse my excursion into sbd but I think we have some common issues/patterns here and I thought it might add some substance to the discussion. Btw. since OCF 1.1 we have the possibility to have a resource running in a degraded state instead of having to fail if something isn't right as it should be. That might be interesting for health-resources - at least under certain circumstances - to show that something isn't right in cluster status (more fine-grained than just seeing the whole node being e.g. yellow).

inouekazu commented 1 year ago

Things I was thinking of would be to return (with a negative monitor result or not) before a timeout but keeping the process ids of still running processes memorized somehow/where as to properly react with the next monitor call.

Great, this will allow us to continue monitoring without increasing the number of processes.

Other approach would be to keep a daemon running that keeps this data and possibly re-trigger tests and have the actual RA to just communicate with that daemon.

Although this approach has a proven track record [1], it was a major change, so it was only considered.

I will also consider with the members involved in pm_diskd [1].

[1] https://github.com/linux-ha-japan/pm_diskd : Japanese users used this agent with pacemaker-1.x to monitor disks.

ClusterLabs / resource-agents

storage-mon: storage_mon updates attribute, waits for child to finish #1812