Modify our external-client initial binding logic

KenBirman commented 3 years ago

Based on Weijia's experiments, we know that something about the initial binding is currently causing external clients to run very slowly -- perhaps they all currently bind to member 0 of the top-level view, for example, due to the way we advertise group "contacts" in the config file. But he found that if he simply connects, does some stuff, then disconnects and resumes his experiments, performance is far higher.

This has me thinking we could make that behavior a standard feature, because in some ways, an initial reconnect makes sense no matter what:

The initial connection via the config file pretty much "has" to designate a single contact or perhaps two contacts. So in general we will have groups of size N and yet all external clients will tend to bind to the current membership leader, who could (1) get overloaded and (2) even if not, having lock-related delays that make requests very slow.
The developer may actually have an idea of which would be the most perfect, ideal top-level member to connect with. Why not let her do so?

So, with these in mind, I want to propose a really simple API change that Weijia could implement in half an hour.

The external client Derecho constructor would have an extra argument, optional: a "proxy selector" function.
When called, we connect to Derecho as we currently do (no modifications to Edward's code), but do an initial RPC to fetch the group view.
We then disconnect, call the proxy selector (pick randomly if no selector is provided, but omit the group view leader if the group has > 1 members, to avoid the leader getting overloaded), and that new member becomes our proxy. Edward, will this cause anything to break? Is there anything in your code that needs external connections to always go to the leader and stay at the leader?
We reconnect to this new proxy.

For example, in Alicia's code, an external member who is a long-lived IoT host, like the server on Julio's farm, would know about the affinity keys or object keys associated with those IoT devices. It anticipates doing put on those keys. So, it could map the affinity key to a specific shard, then pick a random shard member as the proxy. Her put and trigger operations would run a lot faster because they won't need to be relayed.

etremel commented 3 years ago

I didn't originally write the ExternalGroup code, but I'm confident that any member of the group can handle external clients equally well. If a non-leader node receives a join request that specifies it is an "external client request," it simply adds that client to its set of external P2P connections. The external client does need to initially contact a "well-known" node, typically the group leader, in order to request the current View and discover the other members, but once it does that it can choose any node to send its RPC requests to. In fact, it's up to the user-level code that constructs and uses the ExternalGroup to choose which group member it uses as its proxy, since the p2p_send method must be called with a "destination node ID" argument.

KenBirman commented 3 years ago

Edward, do we end up with one TCP connection per node the external client interacts with? Or one connection to a proxy, who then relays the message?

etremel commented 3 years ago

The external client maintains one connection per node it interacts with, and in fact, they can be RDMA connections if possible. ExternalGroup uses the same P2PConnectionManager object as RPCManager, which sets up a mini-SST with each remote node to deliver messages. This SST will use either TCP or RDMA depending on how LibFabrics is configured on the external client. (The main difference between ExternalGroup and RPCManager is that ExternalGroup only sets up a P2PConnection with a node when it actually wants to send a message to it, whereas RPCManager sets up a P2PConnection with every member in the group as soon as it starts up).

Derecho-Project / derecho

Modify our external-client initial binding logic #212