LLNL / UnifyFS

UnifyFS: A file system for burst buffers
Other
99 stars 31 forks source link

Concurrency bug in test_remove_file() #808

Open wangvsa opened 5 months ago

wangvsa commented 5 months ago

Describe the problem you're observing

I was doing some testing using the write-static code and found a subtle bug in https://github.com/LLNL/UnifyFS/blob/58ece4441716678f5111a6dbff9baadd6188c2b6/examples/src/testutil.h#L1251

Describe how to reproduce the problem

When the test file already exists, setting the "remove and reuse test file" flag (-r) would easily trigger the bug. mpirun -n 2 write-static -n 1 -r

Issue

At the end of the test_remove_file(), we do a barrier() to sync processes. The problem is, there is a potential execution path where one or more processes may exit the function earlier, causing wrong barrier matching. https://github.com/LLNL/UnifyFS/blob/58ece4441716678f5111a6dbff9baadd6188c2b6/examples/src/testutil.h#L1262-L1268

This part of code assumes all ranks check the existence of file, get rc = 0, and continue. Problematic execution sequence: Rank 0 calls stat(), get rc = 0, then go ahead delete the file. Then Rank 1 came, its stat() call will return -1 since Rank 0 has already deleted the file. Then Rank 1 exits the function without calling the barrier() at the end.

Fix

Add a barrier immediately after stat() to make sure everyone sees the same result.

Overall, we should be careful when using "return" if we will do a barrier later. We need to make sure all processes will always follow the same path: either all return, or all call barrier.