Closed SatomiOSAWA closed 3 months ago
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-1917/1/input
It should not try to unmount before killing the processes.
Ah, I see! Now I understand that you made the specification change because of that consideration. Then, how about to make it sleep only when it sends a signal? I know "signal_delay=0" works, But this way, users can maintain the same settings as before while reducing stop and F/O durations.
Like this:
[root@rhel93a heartbeat]# diff -u Filesystem.devel.orig Filesystem.devel.check_send_signal
--- Filesystem.devel.orig 2024-02-26 14:12:27.804280013 +0900
+++ Filesystem.devel.check_send_signal 2024-02-29 14:44:04.336890634 +0900
@@ -677,12 +677,13 @@
pids=$(get_pids "$dir")
if [ -z "$pids" ]; then
ocf_log info "No processes on $dir were signalled. force_unmount is set to '$FORCE_UNMOUNT'"
- return
+ return 1
fi
for pid in $pids; do
ocf_log info "sending signal $sig to: $(ps -f $pid | tail -1)"
kill -s $sig $pid
done
+ return 0
}
try_umount() {
local SUB="$1"
@@ -709,12 +710,14 @@
return $ret
}
fs_stop_loop() {
- local SUB="$1" signals="$2" sig
+ local SUB="$1" signals="$2" sig ret send_signal=false
while true; do
for sig in $signals; do
signal_processes "$SUB" $sig
+ ret=$?
+ [ $ret -eq 0 ] && send_signal=true
done
- sleep $OCF_RESKEY_signal_delay
+ $send_signal && sleep $OCF_RESKEY_signal_delay
try_umount "$SUB" && return $OCF_SUCCESS
done
}
Best Regards, Satomi OSAWA
Hi, @oalbrigt !
I'm so sorry, The code I posted for the modification proposal had a bug, so I fixed it. Thank you in advance for your consideration.
Best Regards, Satomi OSAWA
[root@rhel93a heartbeat]# diff -u Filesystem.devel.orig Filesystem.devel.check_send_signal
--- Filesystem.devel.orig 2024-02-26 14:12:27.804280013 +0900
+++ Filesystem.devel.check_send_signal 2024-03-12 12:24:07.637681726 +0900
@@ -677,12 +677,13 @@
pids=$(get_pids "$dir")
if [ -z "$pids" ]; then
ocf_log info "No processes on $dir were signalled. force_unmount is set to '$FORCE_UNMOUNT'"
- return
+ return 1
fi
for pid in $pids; do
ocf_log info "sending signal $sig to: $(ps -f $pid | tail -1)"
kill -s $sig $pid
done
+ return 0
}
try_umount() {
local SUB="$1"
@@ -709,12 +710,15 @@
return $ret
}
fs_stop_loop() {
- local SUB="$1" signals="$2" sig
+ local SUB="$1" signals="$2" sig ret send_signal
while true; do
+ send_signal=false
for sig in $signals; do
signal_processes "$SUB" $sig
+ ret=$?
+ [ $ret -eq 0 ] && send_signal=true
done
- sleep $OCF_RESKEY_signal_delay
+ $send_signal && sleep $OCF_RESKEY_signal_delay
try_umount "$SUB" && return $OCF_SUCCESS
done
}
No worries. We're considering some options here, as we found some possible edge-cases that we should solve as well.
That's truly impressive!! I am eager to see those enhancements achieved. I would like to apologize for troubling you. Thank you so much.
Best Regards, Satomi OSAWA
Hi all,
I have noticed that Filesystem’s stop operation in RHEL 9.3 takes longer than before. (I already reported it in Issue #1907.) While the extension of the duration for each stop operation may not be significant, it seems that every little bit adds up. Therefore, I attempted to improve it. I believe it works well.
Now, I have 10 Filesystems in a group. I measured the time it takes to stop all 10 Filesystems under some conditions. I conducted the measurements 5 times and calculated the average. And the results are as follows:
During the measurements, No processes was accessing the mounted devices. And I had no errors. I simply did that
and checked /var/log/messages.
With this PR, if any processes were accessing mounted devices when initiating the stop operation, the umount command would immediately fail, and the processes would be killed. And then, after sleeping for signal_delay seconds, the umount command will be executed again. This behavior is the same as in RHEL 9.2.
Any opinions or suggestions are welcome.
Best Regards, Satomi OSAWA
=== FYI ===
Node List:
Full List of Resources:
(snip)
Feb 26 19:11:38 rhel93a pacemaker-controld[281626]: notice: Initiating stop operation filesystem10_stop_0 locally on rhel93a
Feb 26 19:11:38 rhel93a pacemaker-controld[281626]: notice: Requesting local execution of stop operation for filesystem10 on rhel93a
Feb 26 19:11:38 rhel93a pacemaker-controld[281626]: notice: Initiating stop operation fence2-ipmilan_stop_0 locally on rhel93a
Feb 26 19:11:38 rhel93a pacemaker-controld[281626]: notice: Requesting local execution of stop operation for fence2-ipmilan on rhel93a
Feb 26 19:11:38 rhel93a pacemaker-controld[281626]: notice: Result of stop operation for fence2-ipmilan on rhel93a: ok
Feb 26 19:11:38 rhel93a Filesystem(filesystem10)[289173]:INFO: Running stop for /dev/sdb10 on /mnt/disk9
Feb 26 19:11:38 rhel93a Filesystem(filesystem10)[289173]:INFO: Trying to unmount /mnt/disk9
Feb 26 19:11:39 rhel93a Filesystem(filesystem10)[289173]:INFO: No processes on /mnt/disk9 were signalled. force_unmount is set to 'safe'
Feb 26 19:11:40 rhel93a kernel:XFS (sdb10): Unmounting Filesystem
Feb 26 19:11:40 rhel93a systemd[1]:mnt-disk9.mount: Deactivated successfully.
Feb 26 19:11:40 rhel93a Filesystem(filesystem10)[289173]:INFO: unmounted /mnt/disk9 successfully
Feb 26 19:11:40 rhel93a pacemaker-controld[281626]: notice: Result of stop operation for filesystem10 on rhel93a: ok
Feb 26 19:11:40 rhel93a pacemaker-controld[281626]: notice: Initiating stop operation filesystem9_stop_0 locally on rhel93a
Feb 26 19:11:40 rhel93a pacemaker-controld[281626]: notice: Requesting local execution of stop operation for filesystem9 on rhel93a
(snip)
Feb 26 19:12:01 rhel93a pacemaker-controld[281626]: notice: Initiating stop operation filesystem1_stop_0 locally on rhel93a
Feb 26 19:12:01 rhel93a pacemaker-controld[281626]: notice: Requesting local execution of stop operation for filesystem1 on rhel93a
Feb 26 19:12:01 rhel93a Filesystem(filesystem1)[291881]:INFO: Running stop for /dev/sdb1 on /mnt/disk0
Feb 26 19:12:01 rhel93a Filesystem(filesystem1)[291881]:INFO: Trying to unmount /mnt/disk0
Feb 26 19:12:02 rhel93a Filesystem(filesystem1)[291881]:INFO: No processes on /mnt/disk0 were signalled. force_unmount is set to 'safe'
Feb 26 19:12:03 rhel93a systemd[1]:mnt-disk0.mount: Deactivated successfully.
Feb 26 19:12:03 rhel93a kernel:XFS (sdb1): Unmounting Filesystem
Feb 26 19:12:03 rhel93a Filesystem(filesystem1)[291881]:INFO: unmounted /mnt/disk0 successfully
Feb 26 19:12:04 rhel93a pacemaker-controld[281626]: notice: Result of stop operation for filesystem1 on rhel93a: ok