IBM / CAST

CAST can enhance the system management of cluster-wide resources. It consists of the open source tools: cluster system management (CSM) and burst buffer.
Eclipse Public License 1.0
27 stars 34 forks source link

Fix for issue #1003 #1004

Closed dlherms-ibm closed 3 years ago

dlherms-ibm commented 3 years ago

This fix addresses issue #1003. Solution is to prevent the add of a duplicate file handle in the registry.

Change to fh.cc has the change to the addFilehandle() method. Previously, only a value of zero could be returned. With this change, a -1 can now be returned indicating the duplicate file handle entry. That -1 will be returned directly for bbProxy. Due to restart logic, bbServer can actually run into this case when attempting to reuse a transfer definition that was already stored in the metadata. In that case, we attempt to release the entry that already exists and then insert the new file handle.

There are four locations in the code that invoked addFilehandle(), three in LVUtils.cc and one in xfer.cc. For LVUtils.cc, one location was adding the source-index to the file handle registry. That was changed to receive the new return value. The other two locations were for the target index and code just prior to the add was already checking for the duplicate. No change for those two invocations. For xfer.cc, it runs within bbServer and the change in fh.cc should suffice.

Another minor change was also made to fh.cc that now logs the correct fd value being closed. Prior code always logged -1 as the fd being closed.

Prior code for bbProxy that ran when a start transfer failed was to close all file handles associated with the job, handle, and contribid. Instead of unconditionally closing all, we now keep track of the file handles that this instance of start transfer has opened. Upon failure, the code now only removes/closes those file handles that were opened for this instance of start transfer.

The following annotated bbProxy log now shows the original failing scenario as successful with the code changes. Issue1003_WithSolutionLog.pdf

dlherms-ibm commented 3 years ago

Pull request created, but additional testing is required...

tgooding commented 3 years ago

Looks good, merging.