use pwritev and preadv for better performance on linux

milahu commented 1 year ago

Luminarys commented 1 year ago

Interesting project, I’ll be curious to see if I can learn anything from what they’re doing.

That being said I see adding vectored IO as complexity that doesn’t provide much gain. Synapse has never (in my testing) been an IO bound application and unless syscall cost has gone up massively I believe seek/write context switching overhead is still far less than actual writes to disk. Of course less overhead is always better but the real question is whether or not that complexity/effort might produce a better result somewhere else in synapse? I’d guess so.

milahu commented 1 year ago

lets keep this open as low-priority micro-optimization

syscalls ("Writing each block") are CPU-bound (current strategy in synapse?) memcpy ("concatenating these buffers") is memory-bound (no?)

from https://github.com/mandreyel/cratetorrent/blob/master/DESIGN.md#vectored-io

A peer downloads pieces in 16 KiB blocks and to save overhead of concatenating these buffers these blocks are stored in the disk write buffer as is, i.e the write buffer is a vector of byte vectors. Writing each block to disk as a separate system call would incur tremendous overhead (relative to the other operations in most of cratetorrent), especially that context switches into kernel space have become more expensive lately to mitigate risks of speculative execution

Luminarys commented 1 year ago

syscalls ("Writing each block") are CPU-bound (current strategy in synapse?)

You're indeed right. I actually was not so precise in my prior comment. When I said that synapse was not IO bound I meant IO bound by disk. The typical bottleneck is network IO, not CPU/Memory/Disk.

Writing each block to disk as a separate system call would incur tremendous overhead

I'm not so convinced of how tremendous this really is. I do agree doing some napkin math, it looks like a lot (on a gigabit connection fully saturated we'd be incurring ~6000 syscalls per second), but in practice on gigabit connections I've yet to observe this ever being a bottleneck.

lets keep this open as low-priority micro-optimization

SGTM. Imo the best way to implement this in synapse would be to perform coalescing of write buffers in the disk module (perhaps with some small heuristic scan of pending jobs in the queue), rather than try to do some complex aggregation inside the peer piece handing code.

Luminarys / synapse

use pwritev and preadv for better performance on linux #239