Realm: all-to-all communication is slow

For the past few months I've been working on a program that needs all-to-all exchanges and Realm doesn't seem to perform distributed all-to-all communication efficiently. To understand what an efficient implementation would look like, I rewrote the all-to-all exchange code using NCCL and compared it with the Realm-based implementation. It turned out that Realm's all-to-all exchange on a DGX-1V cluster (where each node has 4 IB NICs) is at least an order of magnitude slower (14X-16X) than NCCL's. Realm in its current shape would definitely not match NCCL in terms of the performance of collective operations, as it only maintained a window of DMA requests and didn't have a global understanding of the client's communication pattern. (Plus, GASNet is preventing it from using GPUDirect, which is key to multi-node performance with GPUs.) Nevertheless, the achieved bandwidth usage with Realm is somewhat underwhelming; the aggregate bandwidth usage was approximately 2.3 GB/s (37.24 GB/s with NCCL / 16.51X), which seems pretty low even for a single NIC whose peak bandwidth is 12.5 GB/s. (I didn't see any noticeable difference from enabling multi-rail support in GASNet.) This seems to suggest that Realm's inter-node communication needs some serious improvements.

StanfordLegion / legion

Realm: all-to-all communication is slow #967