LLNL / UnifyFS

UnifyFS: A file system for burst buffers
Other
99 stars 31 forks source link

Scalability improvements and a few bug fixes #785

Closed MichaelBrim closed 1 year ago

MichaelBrim commented 1 year ago

Description

The overall theme for this PR is to improve scalability by reducing the number of client requests that end up generating an RPC to a file owner. Mostly, this is done by identifying when many clients on a node are generating a request for the same information, and making sure the node's local server only sends a single remote request to get the information from the owner. Similarly, when making updates to a file (e.g., new extents), this adds some batching of the updates for a given node. In general, this reduces the number of requests that reach the owner from O(# clients) to O(# nodes).

This PR also includes some code cleanup (removing last vestiges of MPI and MDHIM from the server) and a few minor bug fixes.

Motivation and Context

At higher numbers of clients (above 2k) on Frontier, we were seeing client request timeouts due to the serialized processing of these requests at the owner server.

How Has This Been Tested?

With these changes, Unify examples with up to 8k clients (8 ppn @ 1k nodes, or 32 ppn @ 256 nodes) were passing more often. There is still more work to do on multithreading the service manager who processes the file owner requests.

Types of changes

Checklist: