Xilinx-CNS / onload

OpenOnload high performance user-level network stack
Other
562 stars 90 forks source link

eflatency: Optionally echo the packet in the pong reply and support VLAN tags #238

Open osresearch opened 1 month ago

osresearch commented 1 month ago

This patch adds the option of having the pong node copy the contents of the ping message into the reply, which adds a little more realism to the eflatency test since it requires the receiver to read the contents of the message, not just receive the notice that a message has arrived.

Additionally, it also adds the option for 802.1q VLAN tagging for eflatency tests that traverse switches, making it possible to benchmark those switches as well.

It also cleans up a bit of the logic by removing some magic sizes by using sizeof() on various ethernet headers.

osresearch commented 1 month ago

Thanks for the feedback on the patch. I'll make the style corrections and push an updated version.

jfeather-amd commented 1 month ago

Hi @osresearch, sorry for the delay in getting back to you on this! I just finished looking into performance testing this patch and found that there seems to be a significant enough regression that I am hesitant to merge this in its current state. I would like to think for a while longer about how to progress this PR, as I do think this would be a nice change to have! Some options to consider are:

Although I haven't thought for long enough to decide which one of these would be most appropriate.

osresearch commented 1 month ago

Thanks for doing the performance testing on the patch, @jfeather-amd . Can you describe where the slowdowns seem to be? In the non-vlan, non-echo, non-validating case (the default), my latency deltas were in the noise on the X2 and X3 cards, so I'm very curious about your methodology so that I can replicate the results for my future testing.

osresearch commented 1 month ago

I've re-run tests on the X3 cards with better isolation and pinning the eflatency task to a single CPU; the results show no change in the min, 50%, 95% and 99% numbers, although there is an unexpected increase in the mean of about 50ns. This is caused by the unconditional memset() and checksum_udp_pkt() on the send side, although these occur outside of the ci_frc64_get() timing loop and which I had assumed would not affect the timing. Adding if(cfg_validating)... around the packet rewriting removes this effect.

However, this performance regression appears to be an issue with the way mean is computed -- it is the total time for all packets (delta between the two gettimeofday() calls), not the mean of the measured times (rdtsc ticks). I wonder if the mean should be computed as the average of the actual times instead. It is unexpected to me that the first column of results doesn't match the data used for the other columns. I've submitted #240 to compute the mean from the timings array instead of the wall clock time.