[An idea that originated with a comment by Joe Isrealevitz]
Consider P who sends one SMC message to a shard with 3 members {P,Q,R}. How many RDMA operations occur?
Turns out to be 10
1) verb to write the data: 2 (one one-sided transfer per remote member)
2) verb to write the full slots counter: 2 (" ")
3) now each of the receivers, P included, updates and pushes its received messages count: 3*2=6
... so the TOR switch sees 10 RDMA messages
A Mellanox Connect X4 100Gbps switch is rated for 75M messages per second. So 75/10 gives us 7.5M SMC multicasts per-second, which is pretty much the peak seen on the TOCS paper when we graphed the small-messages performance.
Solution? A sender smart enough to batch can do far better (in fact can end up with large messages that run over RDMC, which would blaze). But not every sender is going to be able to do this. So I propose that we add a simple Nagle-like feature, to help with batching.
In our configuration file, we add a new configuration option: SMC_batch_size=k, like k=25 (if the window size is 1000, k=250 could make even more sense). Default=1. Minimum value=1, maximum=SMC window size.
In our code, suppose this is the SMC message with sequence number "seqn". Today we issue RDMA operations aggressively, on every message. With this change, we would only issue a push of the slot and counter if seqn%k=0.
Same for receives: These actually can jump by batch sizes (receive side batching), but the rule would be to only push when we have crossed a k-msg boundary.
Concern: An app that sets k=25 but then only sends one message for some reason would wait until something accidentally pushes the dirty row. Right now that something would be the keep-alive SST uses to sense a hung group member: in SST, we have a trigger that periodically updates a "health counter" and pushes the whole row. But this only runs rarely.
Proposal: If there is a pending inhibited push from sending an SMC, or from an inhibited push on a receive, set a count-down to some sensible value; the predicate evaluation thread can check it. Have it do a forced-push after a suitable delay (maybe 150us?) if we haven't cleared the counter (by doing a push before then).
Impact: The number of RDMA messages per SMC send will drop for two reasons. First, the overhead messages are reduced by a factor of k (those being the counter push and the received-messages count pushes). They drop into the noise by the time k=25 or k=250. But we also will manage to use one RDMA push to carry multiple SMC messages, giving a further benefit.
For a case where the 75M/second limit was a big factor, this will yield a huge speedup (maybe even more than k-fold) for apps that send SMC messages at very high rates! Of course the actual RDMA bandwidth is a limiting factor too... With k=250 we won't see 1.9B SMC messages/second, because if messages were (as an example) 100B in size that would be 190GB/second, and yet we know that our bidirectional peak speed on 100Gbps RDMA is limited to 200Gbps = 25GB/s. So in fact we can't do better than 250M SMC messages per second, for data of that size. Still, that would be 4x better than we get right now.
We could also offer a user-accessible SMC::flush() to avoid even that small forced-push delay if the application has generated a burst of messages but now realizes the burst has reached its end. SMC::flush() would be an actual SMC multicast that is (1) sent urgently, and (2) on receipt, causes the receiver to also flush its received messages counters urgently.
[An idea that originated with a comment by Joe Isrealevitz]
Consider P who sends one SMC message to a shard with 3 members {P,Q,R}. How many RDMA operations occur?
Turns out to be 10
1) verb to write the data: 2 (one one-sided transfer per remote member) 2) verb to write the full slots counter: 2 (" ") 3) now each of the receivers, P included, updates and pushes its received messages count: 3*2=6
... so the TOR switch sees 10 RDMA messages
A Mellanox Connect X4 100Gbps switch is rated for 75M messages per second. So 75/10 gives us 7.5M SMC multicasts per-second, which is pretty much the peak seen on the TOCS paper when we graphed the small-messages performance.
Solution? A sender smart enough to batch can do far better (in fact can end up with large messages that run over RDMC, which would blaze). But not every sender is going to be able to do this. So I propose that we add a simple Nagle-like feature, to help with batching.
In our configuration file, we add a new configuration option: SMC_batch_size=k, like k=25 (if the window size is 1000, k=250 could make even more sense). Default=1. Minimum value=1, maximum=SMC window size.
In our code, suppose this is the SMC message with sequence number "seqn". Today we issue RDMA operations aggressively, on every message. With this change, we would only issue a push of the slot and counter if seqn%k=0.
Same for receives: These actually can jump by batch sizes (receive side batching), but the rule would be to only push when we have crossed a k-msg boundary.
Concern: An app that sets k=25 but then only sends one message for some reason would wait until something accidentally pushes the dirty row. Right now that something would be the keep-alive SST uses to sense a hung group member: in SST, we have a trigger that periodically updates a "health counter" and pushes the whole row. But this only runs rarely.
Proposal: If there is a pending inhibited push from sending an SMC, or from an inhibited push on a receive, set a count-down to some sensible value; the predicate evaluation thread can check it. Have it do a forced-push after a suitable delay (maybe 150us?) if we haven't cleared the counter (by doing a push before then).
Impact: The number of RDMA messages per SMC send will drop for two reasons. First, the overhead messages are reduced by a factor of k (those being the counter push and the received-messages count pushes). They drop into the noise by the time k=25 or k=250. But we also will manage to use one RDMA push to carry multiple SMC messages, giving a further benefit.
For a case where the 75M/second limit was a big factor, this will yield a huge speedup (maybe even more than k-fold) for apps that send SMC messages at very high rates! Of course the actual RDMA bandwidth is a limiting factor too... With k=250 we won't see 1.9B SMC messages/second, because if messages were (as an example) 100B in size that would be 190GB/second, and yet we know that our bidirectional peak speed on 100Gbps RDMA is limited to 200Gbps = 25GB/s. So in fact we can't do better than 250M SMC messages per second, for data of that size. Still, that would be 4x better than we get right now.
We could also offer a user-accessible SMC::flush() to avoid even that small forced-push delay if the application has generated a burst of messages but now realizes the burst has reached its end. SMC::flush() would be an actual SMC multicast that is (1) sent urgently, and (2) on receipt, causes the receiver to also flush its received messages counters urgently.