Open mraleph opened 2 years ago
Regarding epoll-based solution with non-blocking FDs: This is a linux specific discussion. Though if we would like to increase perf specifically for linux (e.g. because it's widely used for server apps or android) there's also another thing to think about: The linux kernel gained a new io_uring interface for doing high performance async I/O - we could switch to that instead of epoll+non-blocking FDs.
@mkustermann – would we be okay limited min Linux kernel to 5.1? Linux-only would still give us Android and ~all of the cloud workloads, so it seems worth the specialization.
@mkustermann – would we be okay limited min Linux kernel to 5.1? Linux-only would still give us Android and ~all of the cloud workloads, so it seems worth the specialization.
@kevmoo On Android there's probably devices with kernels older than 5.1 / 2019. So we couldn't require 5.1 in general. But one could keep two implementations and decide at runtime which one to pick. It's a matter of whether the gains are justified by the work + maintenance burden.
A side note regarding our APIs: In some sense we currently leak this epoll-based implementation details to the user via our APIs. In the lowest-level of I/O APIs we have e.g. RawSocket
which provides a Stream<RawSocketEvent>
: One listens for events like readable
and then has to call read()
to actually read the bytes - instead of a Future<Uint8List> read(int n)
that performs a read in the background.
Is it fixed yet ? We already invest heavily in Flutter, I would like to use dart on server side as thin layer over PostgreSQL DB. Ability to handle more tcp connections and direct or less expensive reads/writes can help immensely.
@corporatepiyush if it was fixed the issue would be closed. Note that socket read/write throughput should already be reasonably good for most usages - so the fact that this issue is open should not preclude you from using Dart on the server.
@mraleph would you be able to share the test code or test procedure you were using to measure the performance differences you observed with the issues you mentioned in this issue's description?
I was just running some simple send-receive benchmarks, which we are running internally. Unfortunately I don't think we have open sourced them. I don't remember exactly which ones, unfortunately.
libuv
recently implemented the io_uring support for IO with 8x throughput. NodeJS and Python async loop are based on this library.
Author of the libuv pull request here. Libuv uses io_uring for file system operations (open, close, stat, rename, etc.) but not network I/O, except for batching epoll_ctl system calls.
For network I/O, epoll is pretty competitive with io_uring, unless you adapt your code to io_uring's model, and even then it's not going to be the 8x speedup we got for file operations. Don't expect miracles.
Having said that, ISC is sponsoring me to work on network I/O and I'm fairly confident I can get a 40-50% boost in throughput, which is nothing to sneeze at. If Google wants to contract with me to work on Dart, please get in touch; I'd be honored.
Also, hi Slava, long time, no see! Munich 2011, I believe?
hi,
If the team decides of some major changes, Then please have a look at this:
Allows web apps to establish direct transmission control protocol (TCP) and user datagram protocol (UDP) communications with network devices and systems.
thanks
hi @bnoordhuis,
First of all, Im much Junior to you, so please ignore my mistakes
For network I/O, epoll is pretty competitive with io_uring
epoll
calls the user-space on each event, while io_uring
calls to the user-space, only on completionShouldn't this bring some major improvements? Or is "a 40-50% boost in throughput" quite a tested say.
thanks
apropos io_uring
: https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html
To protect our users, we decided to limit the usage of io_uring in Google products:
ChromeOS: We disabled io_uring (while we explore new ways to sandbox it).
Android: Our seccomp-bpf filter ensures that io_uring is unreachable to apps. Future Android releases will use SELinux to limit io_uring access to a select few system processes.
GKE AutoPilot: We are investigating disabling io_uring by default.
It is disabled on production Google servers.
Yes, that was the very the first bug report I received after releasing libuv's io_uring support: android users whose apps got instakilled by the seccomp filter. =)
hi,
is this progressing ,or paused?
thanks
hi,
is this progressing ,or paused?
thanks
@gintominto5329 not aware of progress on this issue, is you app a Flutter app or a plain command line Dart app ?
hi,
is this progressing ,or paused?
thanks
@gintominto5329 not aware of progress on this issue, is you app a Flutter app or a plain command line Dart app ?
its a collection of functions/classes ,usually used in command-line apps ,but also an android app
hi,
its a collection of functions/classes ,usually used in command-line apps ,but also an android app
reason I ask is to see if switching to the native stack package https://pub.dev/packages/cronet_http is an option for you
http is not used ,only TCP sockets ,the protocol and format are completely binary ,to reduce space and time wastage
Current implementation of sockets have a number of issues which lead to lower than possible read throughput. I have been looking at this a bit and have identified a number of improvements which are possible. This is by no means an exhaustive list and I have probably missed something, but I am seeing cumulative improvements in the range of 20%-50% in my experiments.
io_buffer.cc
implementation to usecalloc
instead ofmalloc
which means we are zeroing out the memory for IO buffers. This seems completely unnecessary for buffers which are going to be used for IO - in fact zeroing these buffers is redundant.SIGPROF
around various IO related syscalls (likeread
/write
) even when profiler is not running (or can't be running). This is visible on the profiles because it itself requires syscalls. We should only blockSIGPROF
if profile is running.epoll
semantics. See the section below for more discussion of this.Stream<Uint8List>
basedSocket
interface might be causing unnecessary allocations/copying.EPOLLET
fixThe fix for #40589 has introduced the following code into
_NativeSocket.read
:The idea behind this fix is that it attempts to drain socket of all available bytes. Unfortunately this is not how
man epoll
recommends doing it:This change also makes performance of the reading worse because it issues
ioctl(fd, FIONREAD)
to estimate amount of available bytes, which might be slower than actually issuing anotherread
to simply check for-EAGAIN
return code.Additionally the
BytesBuilder
used in this code is copying which means that we often perform a redundant copy (via relatively slowsetRange
) of the received bytes.At the very least we should avoid a redundant copy by doing
BytesBuilder(copy: false)
, but this might not be the biggest win. In reality the whole concatenation is actually a wasted effort because the result ofread
operation is usually pushed into someSink
which is well capable of handling unconcatenated bytes. This means this logic is placed in the wrong place.A better way would be to split
_NativeSocket.read(...)
into two functions:issueRead
called from contexts where result is streamed (e.g.Socket
implementation) andread
which reads all available bytes.These functions could follow a more classical pattern:
This however might hit some issues with memory consumption - because we are allocating relatively large buffers and only a small portion of each buffer will be used for small socket messages. This will be especially bad given the current implementation of
Socket_Read
:IOBuffer
s are allocated usingcalloc
which zeroes the memory unnecessarily.read
returns less bytes than requestedSocket_Read
reallocatesIOBuffer
into a smaller one by allocating a new (smaller one) and copying bytes from initially allocated large buffer into it.This brings me to another topic: lifetime management for IO buffers.
IO buffers lifetime
There are three different types of code that consumes data coming from the socket:
Some consumers are clearly in one of these three categories, some consumers are a combination of different categories, e.g. HttpParser will consume some bytes synchronously (e.g. headers), while the body will be forwarded further through
HttpRequest/HttpResponse
objects to the user of these object to consume.Different styles of consumption lead to different buffer lifetimes, e.g. it would be beneficial for the socket reading code to reuse the same buffer again and again if possible. Unfortunately the current
Stream<Uint8List>
based API does not allow to optimise for this: once the socket has filled out a buffer and pushed it into a sink we have to assume that sink's consumer might be keeping the buffer around for the indefinite time. This also makes it much harder to optimiseIt would be interested to consider how these APIs could be redesigned to allow a better control over lifetime and reuse of the memory. This might be a prerequisite to rewriting
_NativeSocket.read
in a more efficient way.I have experimented with explicitly release socket buffers so that socket could reuse them by adding a public function which tells
dart:io
that the given buffer is no longer needed by the consumer:Which had some positive impact on throughput/latency of simple HTTP/Socket IO benchmarks. However as-is this API looks rather user unfriendly and probably could only be used internally in
dart:io
implementation (e.g.HttpParser
could release buffers it receives from the socket when we know thatHttpParser
is attached to one ofdart:io
native sockets)./cc @a-siva @brianquinlan @kevmoo @mkustermann @lrhn