ClusterLabs / resource-agents

Combined repository of OCF agents from the RHCS and Linux-HA projects
GNU General Public License v2.0
488 stars 577 forks source link

ocf:heartbeat:Filesystem kills unrelated processes #1944

Closed gianlucapiccolo closed 1 month ago

gianlucapiccolo commented 1 month ago

We experienced a bug where some processes were killed during a filesystem resource relocation. The system had 2 mountpoints: /foo and /foo/bar. The relocated ocf:heartbeat:Filesystem resource was related to /foo/bar. All the processes using the /foo/bar mountpoint were correctly killed to dismount the filesystem. What we didn't expect was that also all the processes using the /foo mountpoint were killed too.

Actual behaviour when the /foo/bar resource relocation is done by the ocf:heartbeat:Filesystem script:

  1. it retrieves all the processes that are currently using the /foo/bar filesystem
  2. it sends the terminate signal to those processes
  3. it sleeps till the configured timeout
  4. it tries to dismount the filesystem
  5. it checks if the filesystem has been dismounted
  6. if the filesystem is not yet dismounted (maybe the processes are taking a little bit of time to terminate), it retrieves again the processes currently using that filesystem
  7. this time it sends the kill signal to those processes
  8. it sleeps again till the configured timeout
  9. it tries to dismount the filesystem
  10. the relocation continues normally...

The problem happens with a possible race condition, for example, if between the step 5 and 6 the processes terminated and the filesystem was dismounted, the function that returns the list of PID to be killed, returns the list of all the processes using another filesystem: /foo.