Open KenBirman opened 3 years ago
I didn't originally write the ExternalGroup code, but I'm confident that any member of the group can handle external clients equally well. If a non-leader node receives a join request that specifies it is an "external client request," it simply adds that client to its set of external P2P connections. The external client does need to initially contact a "well-known" node, typically the group leader, in order to request the current View and discover the other members, but once it does that it can choose any node to send its RPC requests to. In fact, it's up to the user-level code that constructs and uses the ExternalGroup to choose which group member it uses as its proxy, since the p2p_send
method must be called with a "destination node ID" argument.
Edward, do we end up with one TCP connection per node the external client interacts with? Or one connection to a proxy, who then relays the message?
The external client maintains one connection per node it interacts with, and in fact, they can be RDMA connections if possible. ExternalGroup uses the same P2PConnectionManager object as RPCManager, which sets up a mini-SST with each remote node to deliver messages. This SST will use either TCP or RDMA depending on how LibFabrics is configured on the external client. (The main difference between ExternalGroup and RPCManager is that ExternalGroup only sets up a P2PConnection with a node when it actually wants to send a message to it, whereas RPCManager sets up a P2PConnection with every member in the group as soon as it starts up).
Based on Weijia's experiments, we know that something about the initial binding is currently causing external clients to run very slowly -- perhaps they all currently bind to member 0 of the top-level view, for example, due to the way we advertise group "contacts" in the config file. But he found that if he simply connects, does some stuff, then disconnects and resumes his experiments, performance is far higher.
This has me thinking we could make that behavior a standard feature, because in some ways, an initial reconnect makes sense no matter what:
So, with these in mind, I want to propose a really simple API change that Weijia could implement in half an hour.
For example, in Alicia's code, an external member who is a long-lived IoT host, like the server on Julio's farm, would know about the affinity keys or object keys associated with those IoT devices. It anticipates doing put on those keys. So, it could map the affinity key to a specific shard, then pick a random shard member as the proxy. Her put and trigger operations would run a lot faster because they won't need to be relayed.