We experienced a bug where some processes were killed during a filesystem resource relocation.
The system had 2 mountpoints: /foo and /foo/bar.
The relocated ocf:heartbeat:Filesystem resource was related to /foo/bar.
All the processes using the /foo/bar mountpoint were correctly killed to dismount the filesystem.
What we didn't expect was that also all the processes using the /foo mountpoint were killed too.
Actual behaviour when the /foo/bar resource relocation is done by the ocf:heartbeat:Filesystem script:
it retrieves all the processes that are currently using the /foo/bar filesystem
it sends the terminate signal to those processes
it sleeps till the configured timeout
it tries to dismount the filesystem
it checks if the filesystem has been dismounted
if the filesystem is not yet dismounted (maybe the processes are taking a little bit of time to terminate), it retrieves again the processes currently using that filesystem
this time it sends the kill signal to those processes
it sleeps again till the configured timeout
it tries to dismount the filesystem
the relocation continues normally...
The problem happens with a possible race condition, for example, if between the step 5 and 6 the processes terminated and the filesystem was dismounted, the function that returns the list of PID to be killed, returns the list of all the processes using another filesystem: /foo.
We experienced a bug where some processes were killed during a filesystem resource relocation. The system had 2 mountpoints:
/foo
and/foo/bar
. The relocated ocf:heartbeat:Filesystem resource was related to/foo/bar
. All the processes using the/foo/bar
mountpoint were correctly killed to dismount the filesystem. What we didn't expect was that also all the processes using the/foo
mountpoint were killed too.Actual behaviour when the
/foo/bar
resource relocation is done by the ocf:heartbeat:Filesystem script:/foo/bar
filesystemThe problem happens with a possible race condition, for example, if between the step 5 and 6 the processes terminated and the filesystem was dismounted, the function that returns the list of PID to be killed, returns the list of all the processes using another filesystem:
/foo
.