erpc-io / eRPC

Efficient RPCs for datacenter networks
Other
851 stars 138 forks source link

Headroom for RDMA ROCE packets? #46

Closed theojepsen closed 3 years ago

theojepsen commented 4 years ago

Why do ROCE eRPC packets have 42 bytes of headroom? I understand how this is used for other types of transport (e.g. "raw"), but for ROCE it just seems like a waste of space. For example, when I send a 48-byte RPC, there's a 14-byte eRPC header, plus 42 bytes of headroom (totaling 104 bytes):

image

I see that headroom is added when you compile with ROCE enabled: https://github.com/erpc-io/eRPC/blob/master/CMakeLists.txt#L163

However, it looks like this headroom is never used in the infiniband transport. The only reference to headroom is an assertion here: https://github.com/erpc-io/eRPC/blob/master/src/transport_impl/infiniband/ib_transport.cc#L31

I commented-out that assertion and re-compiled with set(CONFIG_HEADROOM 0). It doesn't seem to have affected functionality. In fact, it it seems to have dramatically reduced latency, especially for small payloads. This is the latency with 40 bytes of headroom:

$ ./scripts/do.sh 1 0
Installing modded drivers
do.sh: Launching process 1 on NUMA node 0
77:299277 WARNG: Modded driver unavailable. Performance will be low.
Process 1: Creating session to 10.0.1.98:31850.
Process 1: Session connected. Starting work.
write_size median_us 5th_us 99th_us 999th_us
32 5.1 5.0 5.9 8.1
64 5.1 5.0 5.9 8.8
128 5.2 5.1 6.0 8.7
256 5.2 5.1 6.1 8.8
512 5.4 5.2 6.2 8.8
1024 7.5 7.3 8.4 11.0

And after removing headroom:

$ ./scripts/do.sh 1 0                     
Installing modded drivers
do.sh: Launching process 1 on NUMA node 0
87:907285 WARNG: Modded driver unavailable. Performance will be low.
Process 1: Creating session to 10.0.1.98:31850.
Process 1: Session connected. Starting work.
write_size median_us 5th_us 99th_us 999th_us
32 4.1 4.0 5.1 7.7
64 4.6 4.5 5.6 8.8
128 4.6 4.5 5.6 8.6
256 4.7 4.6 5.7 8.2
512 4.8 4.7 5.8 8.8
1024 7.0 6.8 8.0 10.0

For 32-byte payloads, it reduced median latency by 1us!

Is this a bug? Should ROCE packets have these extra headroom bytes?

anujkaliaiitd commented 4 years ago

The goal of the headroom is to allocate space for DMA-receiving the InfiniBand "Global Routing Header (GRH)", see Section 2.7 in https://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf.

In my experience in small networks, NICs don't actually DMA-write the GRH on receiving a RoCE UD packet, but IIRC they do complain if the RECV buffer doesn't have enough space for the GRH and payload.

I think my RoCE implementation might be inefficient because it starts the TX buffer from the first pkthdr byte, whereas (I suspect) it should start from pkthdr + 40 bytes.