ClusterLabs / resource-agents

Combined repository of OCF agents from the RHCS and Linux-HA projects
GNU General Public License v2.0
488 stars 577 forks source link

Filesystem in RHEL9.3 takes considerably longer to complete its stop operation compared to RHEL9.2. #1907

Open SatomiOSAWA opened 6 months ago

SatomiOSAWA commented 6 months ago

Hi all,

I found that Filesystem resource agent in RHEL9.3 takes so much longer to complete its stop operation than RHEL9.2. In my configuration, I have 10 Filesystems. In RHEL9.2, it only takes 2 to 3 seconds to stop all of 10. In RHEL9.3, it takes 23 seconds. I tried setting signal_delay=0, but it still took 12 seconds to stop all. It seems that the get_pids function is taking some time... What should I do to shorten the time for the stop operation in RHEL9.3?

Best Regards, Satomi OSAWA

SatomiOSAWA commented 6 months ago

As an addition, in both RHEL9.3 and RHEL9.2, there is no access to the devices mounted by Filesystems.

oalbrigt commented 6 months ago

You can probably set term_signals="TERM KILL" to immitate the old way.

What errors do you get regarding access to devices in 9.3/9.2?

oalbrigt commented 6 months ago

You can run ausearch -m AVC to get additional information if the device issue is caused by SELinux.

SatomiOSAWA commented 6 months ago

Hello, @oalbrigt !

Thank you for your response.

You can probably set term_signals="TERM KILL" to immitate the old way.

Both in RHEL 9.2 and RHEL 9.3, there were no processes that needed to be killed when the Filesystem resources stopped. And SELinux has been disabled.

What errors do you get regarding access to devices in 9.3/9.2?

I got no errors in either RHEL 9.2 or RHEL 9.3. Below is an excerpt of the logs in RHEL9.3.

(snip) Dec 26 10:16:47 rhel93a pacemaker-controld[78748]: notice: Requesting local execution of stop operation for filesystem10 on rhel93a Dec 26 10:16:47 rhel93a pacemaker-controld[78748]: notice: Requesting local execution of stop operation for fence2-ipmilan on rhel93a Dec 26 10:16:47 rhel93a pacemaker-controld[78748]: notice: Result of stop operation for fence2-ipmilan on rhel93a: ok Dec 26 10:16:47 rhel93a Filesystem(filesystem10)[84349]:INFO: Running stop for /dev/sdb10 on /mnt/disk9 Dec 26 10:16:47 rhel93a Filesystem(filesystem10)[84349]:INFO: Trying to unmount /mnt/disk9 Dec 26 10:16:48 rhel93a Filesystem(filesystem10)[84349]:INFO: No processes on /mnt/disk9 were signalled. force_unmount is set to 'safe' Dec 26 10:16:49 rhel93a kernel:XFS (sdb10): Unmounting Filesystem Dec 26 10:16:49 rhel93a systemd[1]:mnt-disk9.mount: Deactivated successfully. Dec 26 10:16:49 rhel93a Filesystem(filesystem10)[84349]:INFO: unmounted /mnt/disk9 successfully Dec 26 10:16:49 rhel93a pacemaker-controld[78748]: notice: Result of stop operation for filesystem10 on rhel93a: ok (snip)

The Filesystem in RHEL9.3 attempts to identify the process using the device before unmounting it. On the other hand, in RHEL 9.2, it first attempts to unmount, and if that fails, it then tries to identify the process using the device. So, In RHEL 9.3, it might be taking longer due to the process being identified beforehand, I think.

kaneter commented 3 weeks ago

This issue originates from the time taken in steps 1 and 3 in the series of following stop operations.

  1. Identify the process ID using the disk.
  2. Send SIGTERM or SIGKILL to the identified process.
  3. Sleep for signal_delay seconds (default 1 second).
  4. Attempt unmounting.

The reason for having these stop operations in this order is understood to be a precaution against scenarios where unmounting during the use of large file systems by processes takes time regardless of success or failure.

On the other hand, users who are certain that only processes managed by the Resource Agents use the mounted file system may want to attempt unmounting as the first step. For such users, a discussion on whether a OCF parameter for attempting unmounting as the first step in stop operations (e.g., optimistic_unmount=true/false), could be beneficial.

I'd be more than willing to create a pull request if such a OCF parameter is necessary.