LLNL / UnifyFS

UnifyFS: A file system for burst buffers
Other
102 stars 31 forks source link

Probable Incorrect Error Handling in collective_create() #761

Open rgmiller opened 1 year ago

rgmiller commented 1 year ago

System information

Type Version/Name
Operating System All of them, I suspect
OS Version
Architecture
UnifyFS Version

Describe the problem you're observing

In collective_create() (around about line 303) we attempt to acquire a child request handle. If we can't get a handle for some reason, we do NOT clean up and exit (as we do with most other error conditions). Instead, we log an error and set the pointer to NULL, but otherwise continue processing. If no other errors occur, then the function will return a pointer to a coll_request struct, but that struct will contain an unexpected NULL pointer.

Describe how to reproduce the problem

The bug is in the error handling for this function, so you'd have to trigger the specific error. I'm not sure how you'd do that (and I rather suspect it's never happened on a production system).

Include any warning or errors or releveant debugging data

This was discovered during a code inspection as part of the work on #758. For reference, Michael Brim's comment about it is reproduced below:

Regarding the TODO, the reason we don't return immediately in that error case is because that would leak the allocated coll_request structure and its member array allocations. However, it looks like we're missing the logic to notice that we failed to get one of the child handles and need to cleanup the structure before returning NULL. Unfortunately, it appears it's not as easy as just calling collective_cleanup(), since that would prematurely free the original group rpc request handle, which we use in the callers to collective_create() to send back an error. This is complicated enough that we probably should open an issue and address in a separate PR.