mraleph commented 2 years ago

Current implementation of sockets have a number of issues which lead to lower than possible read throughput. I have been looking at this a bit and have identified a number of improvements which are possible. This is by no means an exhaustive list and I have probably missed something, but I am seeing cumulative improvements in the range of 20%-50% in my experiments.

[ ] 95f87f321ab12ddb494ade3139dba08496c27fb6 has changed io_buffer.cc implementation to use calloc instead of malloc which means we are zeroing out the memory for IO buffers. This seems completely unnecessary for buffers which are going to be used for IO - in fact zeroing these buffers is redundant.
[ ] We block/unblock SIGPROF around various IO related syscalls (like read/write) even when profiler is not running (or can't be running). This is visible on the profiles because it itself requires syscalls. We should only block SIGPROF if profile is running.
[ ] 94009bf98064d84d2bc7f1ad3b3d20fc2ffc4390 (and follow ups like be548e724e6e) made socket reading use slow syscalls in attempt to drain the socket from all available bytes to properly adhere to edge-triggering (ET) epoll semantics. See the section below for more discussion of this.
[ ] The Stream<Uint8List> based Socket interface might be causing unnecessary allocations/copying.

`EPOLLET` fix

The fix for #40589 has introduced the following code into _NativeSocket.read:

        // If count is null, read as many bytes as possible.
        // Loop here to ensure bytes that arrived while this read was
        // issued are also read.
        BytesBuilder builder = BytesBuilder();
        do {
          assert(available > 0);
          list = nativeRead(available);
          if (list == null) {
            break;
          }
          builder.add(list);
          available = nativeAvailable();
        } while (available > 0);

The idea behind this fix is that it attempts to drain socket of all available bytes. Unfortunately this is not how man epoll recommends doing it:

      The suggested way to use epoll as an edge-triggered (EPOLLET)
       interface is as follows:

       a) with nonblocking file descriptors; and

       b) by waiting for an event only after read(2) or write(2) return
          EAGAIN.

...

   Questions and answers

...

     9.  Do I need to continuously read/write a file descriptor until
           EAGAIN when using the EPOLLET flag (edge-triggered behavior)?

           Receiving an event from epoll_wait(2) should suggest to you
           that such file descriptor is ready for the requested I/O
           operation.  You must consider it ready until the next
           (nonblocking) read/write yields EAGAIN.  When and how you
           will use the file descriptor is entirely up to you.

           For packet/token-oriented files (e.g., datagram socket,
           terminal in canonical mode), the only way to detect the end
           of the read/write I/O space is to continue to read/write
           until EAGAIN.

This change also makes performance of the reading worse because it issues ioctl(fd, FIONREAD) to estimate amount of available bytes, which might be slower than actually issuing another read to simply check for -EAGAIN return code.

Additionally the BytesBuilder used in this code is copying which means that we often perform a redundant copy (via relatively slow setRange) of the received bytes.

At the very least we should avoid a redundant copy by doing BytesBuilder(copy: false), but this might not be the biggest win. In reality the whole concatenation is actually a wasted effort because the result of read operation is usually pushed into some Sink which is well capable of handling unconcatenated bytes. This means this logic is placed in the wrong place.

A better way would be to split _NativeSocket.read(...) into two functions:

issueRead called from contexts where result is streamed (e.g. Socket implementation) and
read which reads all available bytes.

These functions could follow a more classical pattern:

  static const int readBufferSize = 8 * 1024;

  /// Read at most [readBufferSize] from this socket.
  /// The caller *must* call this until `null` is returned.
  Uint8List? readSome() {
    return nativeRead(readBufferSize);
  }

  /// Read specified [count] of bytes from the socket. If [count] is `null`
  /// then read all available bytes from the socket.
  Uint8List? read(int? count) {
    if (count != null) {
      return nativeRead(count);
    } else {
      Uint8List? list = readSome();
      if (list != null) {
        final nextList = readSome();
        if (nextList != null) {
          BytesBuilder builder = BytesBuilder(copy: false);
          builder.add(list);
          builder.add(nextList);
          do {
            list = readSome(readBufferSize);
            builder.add(list);
          } while (list != null);
          list = builder.takeBytes();
        }
      }
      return list;
    }
  }

This however might hit some issues with memory consumption - because we are allocating relatively large buffers and only a small portion of each buffer will be used for small socket messages. This will be especially bad given the current implementation of Socket_Read:

IOBuffers are allocated using calloc which zeroes the memory unnecessarily.
If read returns less bytes than requested Socket_Read reallocates IOBuffer into a smaller one by allocating a new (smaller one) and copying bytes from initially allocated large buffer into it.

This brings me to another topic: lifetime management for IO buffers.

IO buffers lifetime

There are three different types of code that consumes data coming from the socket:

// Consumer which synchronously consumes each chunk
socket.listen((data) {
  // consume all bytes available in the data
  // e.g. Utf8 decoder, Json parser.
});

// Consumer which asynchronously consumes each chunk
socket.listen((data) {
  // push data into some Sink
  // e.g. when writing File to the disc or forwarding to another stream
});

// Consumer which accumulates incoming data 
socket.listen((data) {
  // Accumulate incoming data and concatenate it.
  // e.g. downloading an image and parsing it with non-streaming
  // parser at the end of the download
});

Some consumers are clearly in one of these three categories, some consumers are a combination of different categories, e.g. HttpParser will consume some bytes synchronously (e.g. headers), while the body will be forwarded further through HttpRequest/HttpResponse objects to the user of these object to consume.

Different styles of consumption lead to different buffer lifetimes, e.g. it would be beneficial for the socket reading code to reuse the same buffer again and again if possible. Unfortunately the current Stream<Uint8List> based API does not allow to optimise for this: once the socket has filled out a buffer and pushed it into a sink we have to assume that sink's consumer might be keeping the buffer around for the indefinite time. This also makes it much harder to optimise

It would be interested to consider how these APIs could be redesigned to allow a better control over lifetime and reuse of the memory. This might be a prerequisite to rewriting _NativeSocket.read in a more efficient way.

I have experimented with explicitly release socket buffers so that socket could reuse them by adding a public function which tells dart:io that the given buffer is no longer needed by the consumer:

/// Allow the IO subsystem to reuse the backing storage of the given [buffer] for 
/// other IO operations. [buffer] is no longer valid and should not be used after 
/// this operation. 
void releaseIoBuffer(Uint8List buffer);

Which had some positive impact on throughput/latency of simple HTTP/Socket IO benchmarks. However as-is this API looks rather user unfriendly and probably could only be used internally in dart:io implementation (e.g. HttpParser could release buffers it receives from the socket when we know that HttpParser is attached to one of dart:io native sockets).

/cc @a-siva @brianquinlan @kevmoo @mkustermann @lrhn

mkustermann commented 2 years ago

Regarding epoll-based solution with non-blocking FDs: This is a linux specific discussion. Though if we would like to increase perf specifically for linux (e.g. because it's widely used for server apps or android) there's also another thing to think about: The linux kernel gained a new io_uring interface for doing high performance async I/O - we could switch to that instead of epoll+non-blocking FDs.

kevmoo commented 2 years ago

@mkustermann – would we be okay limited min Linux kernel to 5.1? Linux-only would still give us Android and ~all of the cloud workloads, so it seems worth the specialization.

mkustermann commented 2 years ago

@mkustermann – would we be okay limited min Linux kernel to 5.1? Linux-only would still give us Android and ~all of the cloud workloads, so it seems worth the specialization.

@kevmoo On Android there's probably devices with kernels older than 5.1 / 2019. So we couldn't require 5.1 in general. But one could keep two implementations and decide at runtime which one to pick. It's a matter of whether the gains are justified by the work + maintenance burden.

A side note regarding our APIs: In some sense we currently leak this epoll-based implementation details to the user via our APIs. In the lowest-level of I/O APIs we have e.g. RawSocket which provides a Stream<RawSocketEvent>: One listens for events like readable and then has to call read() to actually read the bytes - instead of a Future<Uint8List> read(int n) that performs a read in the background.

corporatepiyush commented 2 years ago

Is it fixed yet ? We already invest heavily in Flutter, I would like to use dart on server side as thin layer over PostgreSQL DB. Ability to handle more tcp connections and direct or less expensive reads/writes can help immensely.

mraleph commented 2 years ago

@corporatepiyush if it was fixed the issue would be closed. Note that socket read/write throughput should already be reasonably good for most usages - so the fact that this issue is open should not preclude you from using Dart on the server.

maks commented 2 years ago

@mraleph would you be able to share the test code or test procedure you were using to measure the performance differences you observed with the issues you mentioned in this issue's description?

mraleph commented 2 years ago

I was just running some simple send-receive benchmarks, which we are running internally. Unfortunately I don't think we have open sourced them. I don't remember exactly which ones, unfortunately.

corporatepiyush commented 1 year ago

libuv recently implemented the io_uring support for IO with 8x throughput. NodeJS and Python async loop are based on this library.

https://github.com/libuv/libuv/pull/3952

bnoordhuis commented 1 year ago

Author of the libuv pull request here. Libuv uses io_uring for file system operations (open, close, stat, rename, etc.) but not network I/O, except for batching epoll_ctl system calls.

For network I/O, epoll is pretty competitive with io_uring, unless you adapt your code to io_uring's model, and even then it's not going to be the 8x speedup we got for file operations. Don't expect miracles.

Having said that, ISC is sponsoring me to work on network I/O and I'm fairly confident I can get a 40-50% boost in throughput, which is nothing to sneeze at. If Google wants to contract with me to work on Dart, please get in touch; I'd be honored.

Also, hi Slava, long time, no see! Munich 2011, I believe?

gintominto5329 commented 1 year ago

hi,

If the team decides of some major changes, Then please have a look at this:

New Chrome Feature: Direct Sockets API

Allows web apps to establish direct transmission control protocol (TCP) and user datagram protocol (UDP) communications with network devices and systems.

Spec,
Explaination

thanks

gintominto5329 commented 1 year ago

hi @bnoordhuis,

First of all, Im much Junior to you, so please ignore my mistakes

For network I/O, epoll is pretty competitive with io_uring

The epoll calls the user-space on each event, while io_uring calls to the user-space, only on completion
We know that the overhead of a single, user-to-kernel space, context switch, is 700ns, minimum, after "Spectre and Meltdown" mitigations

Shouldn't this bring some major improvements? Or is "a 40-50% boost in throughput" quite a tested say.

thanks

mraleph commented 1 year ago

apropos io_uring: https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html

To protect our users, we decided to limit the usage of io_uring in Google products:

ChromeOS: We disabled io_uring (while we explore new ways to sandbox it).

Android: Our seccomp-bpf filter ensures that io_uring is unreachable to apps. Future Android releases will use SELinux to limit io_uring access to a select few system processes.

GKE AutoPilot: We are investigating disabling io_uring by default.

It is disabled on production Google servers.

bnoordhuis commented 1 year ago

Yes, that was the very the first bug report I received after releasing libuv's io_uring support: android users whose apps got instakilled by the seccomp filter. =)

gintominto5329 commented 9 months ago

hi,

is this progressing ,or paused?

thanks

a-siva commented 8 months ago

hi,

is this progressing ,or paused?

thanks

@gintominto5329 not aware of progress on this issue, is you app a Flutter app or a plain command line Dart app ?

gintominto5329 commented 8 months ago

hi,

is this progressing ,or paused?

thanks

@gintominto5329 not aware of progress on this issue, is you app a Flutter app or a plain command line Dart app ?

its a collection of functions/classes ,usually used in command-line apps ,but also an android app

a-siva commented 8 months ago

hi,

its a collection of functions/classes ,usually used in command-line apps ,but also an android app

reason I ask is to see if switching to the native stack package https://pub.dev/packages/cronet_http is an option for you

gintominto5329 commented 8 months ago

http is not used ,only TCP sockets ,the protocol and format are completely binary ,to reduce space and time wastage

dart-lang / sdk

improve socket read throughput #48210

`EPOLLET` fix

IO buffers lifetime

New Chrome Feature: Direct Sockets API

dart-lang / sdk

improve socket read throughput #48210

EPOLLET fix

IO buffers lifetime

New Chrome Feature: Direct Sockets API

`EPOLLET` fix