Optimize message transfer performance between processes

References:

Issue #261 (change thread wakeup handle)

What we need:

a way to transfer the data
some basic framing (where does a message begin, end, how many messages)
some notification / wake / wait mechanism

Transfer:

memfd (just a nice way to hand over shmem as a file descriptor - but is it fast also compared to shm functions?)
shmem (create region, attach region, rm region - very simple; is around since long time via POSIX)
some Rust channel impl (includes message boundaries handling - directly get the Rust struct out on the other end, but we only send some u8 array / vec anyway)
ringbuffer (depends if the entries can be pointers = nice case or if the entries have to be fixed-size, then a protocol is needed for framing and begin-end of messages)
standard Rust way using some shared state struct gated by Arc<Mutex<>> or an array of mutexes like Arc<Vec<Mutex<Vec[u8]>>> where the sender has all entries locked and the receiver waits on the first lock, until the sender frees it when a message is ready? Lots of locking / unlocking?

Notification:

mutex - aren't all implementations of mutexes backed by futexes on Linux these days?
futex - fast userspace mutex. Rust uses futexes for Mutex implementation on Linux anyway
semaphore - "one of the fastest paths in the kernel, everything except sleep is userspace; leaner than anything related to IO, which is complex". Nice, but semaphore is also just a mutex + counter + special rules which thread it wakes up (which we dont need)
eventfd
condvar - most simple solution; just mutex + bool, see https://doc.rust-lang.org/src/std/sync/condvar.rs.html#188-195

Solutions/modules:

List: https://github.com/mgeier/rtrb#alternatives
https://github.com/mgeier/rtrb/issues/39 -> rtrb and npnc look good, also spsc-bounded-queue (or is it bounded-spsc-queue?)
check out rtrb -> chunks module for sending/receiving multiple items at once efficiently.
npnc has multiple versions, bounded and unbounded and also an MPMC version all in one library.
but both cannot wait/block -> need crossbeam for that, which can also block/select on multiple channels.
currently need to spin-wait and yield etc. - not a good solution

Ringbuffers:

https://crates.io/crates/shmem-ipc looks damn fast - using shared memory (= much much faster than any socket or pipe) and using latest Linux memfd (shared-memory handover in Linux kernel) and Linux eventfd (for non-spinning wait without requiring async-std or tokio, but still can be used with these if needed) - and these primitives have C bindings aka are normal Linux kernel primitives and not Rust-specific - so, it is possible to hand over the component thread to a C-ABI library and allow direct sending and receiving from flowd main thread to the C-ABI component which can understand memfd and eventfd, so no omponent API required in this way. Looks like a sweet spot (shmen and
rtrb has been the fastet so far in benchmarks

TODO is there a standard shmem message/frame transfer protocol that has bindings in many languages?
TODO use d-bus?
- performance vs. shmem-ipc - dbus is about 15/17 times slower. we can still have a dbus adapter component but using that as the backbone seems like a huge waste.
- dbus crate

Channel-like:

flume - ergonomic, easy to implement monitoring and tracing. But...can have a "monitor" component to monitor at specific points. Rust-specific. In tests not faster than crossbeam-channels
crossbeam-channels: performance. In tests not faster than crossbeam-channels.
Kanal: performance faster than flume, crossbeam_channel and mpsc. As fast as rtrb + condvar!

io_uring:

Is only between kernel and a userspace thread, not for between threads.

semaphore

this is mutex + number + condvar with some additional rules about who gets notified. useful for managing limited resources with multiple threads taking out a resource.
we dont need this.

condvar + Mutex:

this is mutex + bool + condvar "has this resource changed?" "are we done yet?" "got mail?" type of signals. without the condvar, it would not know if the mutex got unlocked with or without the signal being intended/fired.
yes that is what we need. (or we ask the vecdeque if it has new data - but surely more expensive.)

-> Fast for the Rust-Rust case and for C-based (shmem-ipc) and for the others there will be various component APIs (stateful, stateless etc.) which require a translator component which translates from shmem-ipc to Javascript buffers or whatever.

Winner of performance tests:

for message transfer: rtrb; even without chunked read/write. a tiny bit faster than npnc bounded spsc (the mpmc is much more expensive); very close about 3%: standard collection VecDeque (which OTOH allows for mpsc).
for notification: CondVar - just a mutex + bool. even faster than eventfd.
Everything else was far behind (npnc mpmp channels, mpsc, shmem-ipc, flume, Arc over array of array of u8, Arc over vec of vec, even eventfd)

Round 2:

for message transfer: kanal
for notification: (included in kanal)

Round 3:

kanal is still slower than Thread::park() and unpark()
for message transfer: rtrb (which also offers chunk-wise dequeueing)
for notification: Thread::park() and Thread::unpark()

ERnsTL / flowd

Optimize message transfer performance between processes #203