pop from submission queue checks of deque len and push into IO Uring buffer checks whether it's full.
IO_uring allows to push_multiple entries to amortize is_full check. VecDeque is trickier since it could be uncontiguous and doesn't provide range slice interface.
We could optimize the case when squeue.len() <= io_queue.len().
If vecdeque is contiguous, i.e. deque.as_slices returns a pair of empty and nonempty slice. Then we could use a single io_uring::push_multiple with nonempty slice.
When VecDeque is not contiguous we could do 2 push_multiple operations - first with the second slice, because we queue operations from the back and second with the first slice.
After pushing we can clear the deque because we consider squeue.len() <= io_queue.len() case.
I think the latency potential of optimization is much less then the duration of syscall. But with batching we optimize for throughput and the goal is to decrease driver CPU overhead to enable more useful application work with fully utilized CPU core.
It's possible to document this optimization and recommend user to design application in a way that bounds scheduled io_work to the configured number of submission queue entries.
pop from submission queue checks of deque len and push into IO Uring buffer checks whether it's full.
IO_uring allows to push_multiple entries to amortize is_full check. VecDeque is trickier since it could be uncontiguous and doesn't provide range slice interface.
We could optimize the case when squeue.len() <= io_queue.len().
If vecdeque is contiguous, i.e. deque.as_slices returns a pair of empty and nonempty slice. Then we could use a single io_uring::push_multiple with nonempty slice.
When VecDeque is not contiguous we could do 2 push_multiple operations - first with the second slice, because we queue operations from the back and second with the first slice.
After pushing we can clear the deque because we consider squeue.len() <= io_queue.len() case.
I think the latency potential of optimization is much less then the duration of syscall. But with batching we optimize for throughput and the goal is to decrease driver CPU overhead to enable more useful application work with fully utilized CPU core.
It's possible to document this optimization and recommend user to design application in a way that bounds scheduled io_work to the configured number of submission queue entries.