Add locking around server data structures referenced by client rpc handlers

adammoody commented 5 years ago

Describe the problem you're observing

The server used to process commands from clients by polling on a pool of open sockets. Each socket event was processed one at a time. Thus, commands were serialized so that a client request was handled completely before starting to process a command from another client.

Now we are using Mercury RPC calls with argobots, and it's not clear whether those will be serialized or whether they can run in parallel. My guess is that they are parallel, and in that case, we need to add some read/write locking when changing data structures, like adding a new client app_config to the list of active app_config structures during MOUNT.

If that's the case, we need to consider all possible such data races that might come up due to processing of multiple client rpcs in parallel.

Describe how to reproduce the problem

Include any warning or errors or releveant debugging data

jgmoore-or commented 5 years ago

If the rpc dispatch ULTs all run in the same mercury pool, then locking/critical section in the handlers' code may not be required. Argobots ULTs run until they voluntarily preempt, which occurs when they make mercury/margo calls and/or upon completion of the handler function.

adammoody commented 5 years ago

Thanks @jgmoore-or . Good to hear from you, and thanks for still following along here. Hope things are going well.

I figured things might not be as bad with preemption and related locking with argobot's user level threads, but was taking the conservative approach. Sounds like life might be easier than that.

jgmoore-or commented 5 years ago

I think the RPC handlers that are currently in the Unify-CR code do all of their access to data structures in non-preempted code segments

The only place I had a concern was for server-to-server RPC. In UnifyCR, the metadata tier (MdHIM) and the file transfer tier work differently:

MDHIM decides if the lookup is local or remote. If it's local, the lookup is invoke via local function call. If not, an (MPI) communication is used to request the lookup on a peer server.
The file transfer tier treats every request as remote: It sends all request over MPI to the correct service_manager (even it's local -- i.e., in the same process as the requester).

With Mercury/Margo, a potential sharing contention problem can occur if a local lookup (running in the client RPC handler pool) can access data structures at the same time a remote RPC (running the server-to-server handler pool). Phil Carns told me there's a way to have the remote request get funneled into the client handlers' pool (in another ULT than the one servicing the server-to-server RPC handler). This would allow structure access to be single threaded for local/remote combined access.

I faced this issue in the prototype, but it's not in the UnifyCR implementation yet (AFAIK). If MDHIM/HXHIM switches to Margo, the problem will probably be encountered.

On Wed, Dec 5, 2018 at 4:02 PM Adam Moody notifications@github.com wrote:

Thanks @jgmoore-or https://github.com/jgmoore-or . Good to hear from you, and thanks for still following along here. Hope things are going well.

I figured things might not be as bad with preemption and related locking with argobot's user level threads, but was taking the conservative approach. Sounds like life might be easier than that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LLNL/UnifyCR/issues/198#issuecomment-444668292, or mute the thread https://github.com/notifications/unsubscribe-auth/AZL8NiTn-Tj1fqAS4sv5SaVmL4PVOIEtks5u2EKMgaJpZM4YaRV_ .

kathrynmohror commented 5 years ago

@jgmoore-or JOSEPH!!!!! great to hear from you and thanks for your help :-)

dsikich commented 5 years ago

@jgmoore-or Thanks Joseph. This information will be very helpful.

LLNL / UnifyFS