IBM / CAST

CAST can enhance the system management of cluster-wide resources. It consists of the open source tools: cluster system management (CSM) and burst buffer.
Eclipse Public License 1.0
27 stars 34 forks source link

Duplicate transfer request within the same job prevents cleanup of the job's logical volume #1003

Open dlherms-ibm opened 3 years ago

dlherms-ibm commented 3 years ago

When submitting the same transfer definition twice within a job and also specifying the same handle and contribid, one of the transfers is successful and the other fails. That is expected. However, after both operations have run to completion, file locks remain on the compute node. These file locks prevent the unmount of the path to the job's logical volume and the deletion of that logical volume. The work around is to end bbProxy for the compute node, manually unmount the path to the logical volume, and then restart bbProxy. Restarting bbProxy will automatically delete that orphaned logical volume.

While this is an application error that performs the same transfer twice using the same handle and contribid, the system should not be left in a state such that the job's logical volume cannot be properly cleaned up.

dlherms-ibm commented 3 years ago

So, the problem is that when the second start transfer runs, it opens the source file, builds a file handle, and inserts it into the file handle registry overlaying the first file handle which then leaks the fd. Then the failing second transfer removes the file handle it just inserted into the registry. When the first transfer actually completes, it can't find any file handle to close upon completion of the transfer for the file.

I put in additional info logging for fh and fh registry logic in the following bbproxy log that shows what is happening. Entry at timestamp 2021-04-29 13:10:25.802870 shows the overlay and the timestamp at 2021-04-29 13:10:28.938363 shows where we can't find the file handle upon completion of the actual transfer.

Issue1003_Problem log.pdf

Solution being pursued is to check to see if the same file handle entry exists in the registry prior to it being inserted. If a duplicate, error out the operation at that point so that we do not leak the file handle and file descriptor.