Before dealing with low lat at fast cast level, you need to ensure the underlying os is set up correctly
use a well established linux distro, no cutting edge
ensure firewall, reverse packing filters and other network interceptors are OFF
use a smallish MTU (1500 is ok, not so 9k or 16k)
use a kernel bypass driver (e.g. openonload)
on fast-cast side
use small dgramSize, 800..MTU size
turn spin-polling on (locks create to much lat). Burns a CPU.
set pps as high as possible (open onload with decent network hardware can support a pps of 50 to 100k loss free np) without running into mass retransmissions, apply application level throttling in case
if you can't set pps to > 50k, at least try increasing ppsWindow to 100.
on receive side process packets in-thread hp (slow processing will cause retransmissions). Separating msg processing in a different thread (e.g. through a queue) might create more latency than the network stack easily. If you need to process in a separate thread, use disruptor, not a standard queue.
on app/java side:
don't use serialization as encoding as it will create garbage => triggers gc => bad 95.X percentile latency. Use an allocation free encoding/decoding or use a pauseless VM like azul.
misc:
pin threads (improvement can be minor, depends on where you come from)
Could you give a hint, how to achieve the lowest possible latency? Thank you.