LLNL / UnifyFS

UnifyFS: A file system for burst buffers
Other
99 stars 31 forks source link

Potential deadlock caused by concurrent sync calls #810

Open wangvsa opened 5 months ago

wangvsa commented 5 months ago

Describe the problem you're observing

I'm observing some TIMEOUT errors when trying to stage-in many files simultaneously. It seems that concurrent unifyfs_sync() may cause deadlock on the server side. After some investigations, I found the server side is blocking at the process_pending_sync call in this case:

client A on server 0 --> write/sync file 1 --> owner is server 1 client B on server 1 --> write/sync file 2 --> owner is server 0

https://github.com/LLNL/UnifyFS/blob/58ece4441716678f5111a6dbff9baadd6188c2b6/server/src/unifyfs_service_manager.c#L1479-L1483

@MichaelBrim Is this the cause? Any idea how to fix this?