async sockets in linux (windows is fine) use 5-8x expected cpu vs synchronous sockets or comparables in other languages

andrewaggb commented 2 years ago

I'm finding that any form of async tcp or unix socket in c#/.net 6 is using 5-8x the cpu usage compared to using synchronous sockets in c# or doing essentially the same thing in python3, nodejs, or netcat in listen mode. Async sockets in c# on windows perform fine. I've tried several different forms of async with sockets in c# and they all seem to behave the same.

In my test scenario I'm saturating a gigabit ethernet link via netcat because it's very easy to reproduce. My production scenario is a video surveillance application receiving many simultaneous hd video feeds.

I have experienced this issue in my production application with and without docker, on physical machines and in hyper-v virtual machines, on centos 7.9, rocky 8.6, and ubuntu 20.04.

My examples below will all be ubuntu 20.04 with dotnet 6 installed on the host. Hyper-v virtual machine on windows 11 with 4 hyper-v cpus on a ryzen 5600g

I'm using cpu percentages from the top command and these are all processing the same gigabit traffic (no example here is processing more or less MB/s - they are all a full gigabit link verified with application counters and tools like iftop)

using a standard TCP Listener -> NetworkStream with sync reads and a ~1MB buffer ~6-8% cpu usage. I have no problem with this and as you'll see below this result is consistent with other tools and programming languages in my test environment.

  byte[] buffer = new byte[1048576];
  while ((nBytes **= stream.Read(**buffer, 0, buffer.Length)) > 0)
  {
      totalBytes += nBytes;
  }

using a standard TCP Listener -> NetworkStream with async reads and a ~1MB buffer. The only difference is await ReadAsync vs Read ~50-60% cpu usage - 5-8x as much cpu as the synchronous version.

  byte[] buffer = new byte[1048576];
  while((nBytes = **await stream.ReadAsync(**buffer, 0, buffer.Length)) > 0)
  {
      totalBytes += nBytes;
  }

I have also tried socket.BeginReceive with approximately the same 50-60% cpu usage I have also tried socket.ReceiveAsync with SocketAsyncEventArgs with approximately the same 50-60% cpu usage

I have tried unix Sockets in c#/.net 6 With unix sockets the cpu usage for a synchronous read loop is around 6% of cpu (vs 6-8% for tcp sockets). I piped a netcat receiver into the unix socket to ensure the same amount of data was being processed.

With unix sockets the cpu usage for an asynchronous read loop is around 50-60% of cpu (~ the same as async tcp sockets). I piped a netcat receiver into the unix socket to ensure the same amount of data was being processed.

I ran some comparables

using netcat as a listener nc -l 9989 >/dev/null uses about 8% cpu (same as synchronous sockets in c#)

using nodejs and the socket.on data callback uses about 10% cpu (slightly more than synchronous c# but 5-6x less than async c# and this is async and not re-using a receive buffer so it's less efficient)

using python3 with a synchronous tcp socket receiving into a buffer (very comparable to my c# synchronous example) uses about 8% cpu (same as synchronous sockets in c#)

using golang with a goroutine per connection receiving into a buffer (comparable to my c# synchronous example) uses about 14% cpu (approximately double synchronous sockets in c#)

My basic test methodology is From another machine run netcat to generate a full gigabit/s of ethernet traffic cat /dev/zero | nc 192.168.1.93 9989

On the destination machine 192.168.1.93 (the ubuntu 20.04 vm all the tests were run on)

High CPU usage dotnet run --configuration Release PerfTestTCPListenerAsync

Low CPU usage dotnet run --configuration Release PerfTestTCPListener nc -l 9989 >/dev/null nodejs jsserver.js python3 pyserver.py

I've included the most basic c# example Program.cs using the exact same code with a Read vs ReadAsync, a very basic python example pyserver.py and a very basic nodejs example jsserver.js. All are configured to listen on port 9989 so the same netcat client command can be run against any of the 5 examples.

I've been monitoring the cpu usage with top and network traffic with iftop.

data.zip

Async version dotnet-counters monitor (I'm note sure if the cpu usage here is quoting all cores but top value shows 52% for this async socket process) [System.Runtime] % Time in GC since last GC (%) 0 Allocation Rate (B / 1 sec) 244,648 CPU Usage (%) 16 Exception Count (Count / 1 sec) 0 GC Committed Bytes (MB) 9 GC Fragmentation (%) 0.233 GC Heap Size (MB) 6 Gen 0 GC Count (Count / 1 sec) 0 Gen 0 Size (B) 24 Gen 1 GC Count (Count / 1 sec) 0 Gen 1 Size (B) 245,448 Gen 2 GC Count (Count / 1 sec) 0 Gen 2 Size (B) 24 IL Bytes Jitted (B) 48,860 LOH Size (B) 1,048,656 Monitor Lock Contention Count (Count / 1 sec) 0 Number of Active Timers 0 Number of Assemblies Loaded 18 Number of Methods Jitted 392 POH (Pinned Object Heap) Size (B) 39,976 ThreadPool Completed Work Item Count (Count / 1 sec) 6,132 ThreadPool Queue Length 0 ThreadPool Thread Count 5 Time spent in JIT (ms / 1 sec) 0 Working Set (MB) 47

Synchronous version (top shows cpu usage as around 6%) [System.Runtime] % Time in GC since last GC (%) 0 Allocation Rate (B / 1 sec) 32,672 CPU Usage (%) 0 Exception Count (Count / 1 sec) 0 GC Committed Bytes (MB) 26 GC Fragmentation (%) 1.105 GC Heap Size (MB) 25 Gen 0 GC Count (Count / 1 sec) 0 Gen 0 Size (B) 24 Gen 1 GC Count (Count / 1 sec) 0 Gen 1 Size (B) 7,806,528 Gen 2 GC Count (Count / 1 sec) 0 Gen 2 Size (B) 12,179,944 IL Bytes Jitted (B) 295,457 LOH Size (B) 287,848 Monitor Lock Contention Count (Count / 1 sec) 0 Number of Active Timers 4 Number of Assemblies Loaded 121 Number of Methods Jitted 3,737 POH (Pinned Object Heap) Size (B) 195,360 ThreadPool Completed Work Item Count (Count / 1 sec) 0 ThreadPool Queue Length 0 ThreadPool Thread Count 4 Time spent in JIT (ms / 1 sec) 0 Working Set (MB) 140

Linux Traces: I've attached 20 second dotnet-trace results from both Synchronous and Asynchronous in Linux. It appears to me all (or most)of the extra time is spent in LowLevelLifoSemaphore but I'm not sure if that's expected.

Traces.zip

Versus Windows (0.1% cpu in async mode, 12700k)

TraceAsyncWindows.zip Of course the windows trace looks very different as it uses IOCompletion

I ran some tests using a similarly spec'd windows 10 hyper-v VM and got pretty lousy numbers. Both Synchronous and Async used about 50% of the cpu (4 hyper-v cpus on a 5700g) and only got up to about 600mbps. While it doesn't invalidate the enormous synchronous vs asynchronous performance discrepancy I observed on linux it may invalidate the windows vs linux argument. These aren't heavily loaded hyper-visors so I'm surprised to be honest. Python3 got 35% cpu usage and 800mbps on the same windows 10 machine.

Using a 5700g with windows 11 (not in a hyper-vm) shows similar 0.3%ish cpu usage and full gigabit traffic for .net 6 async and slightly higher cpu usage for python3 (0.4ish) for full gigabit traffic.

Using an intel 12400 (non virtualized) rocky linux 8.5 I had 40% cpu usage in async and 1% for synchronous on 100mbit interface and 47% async vs 1% synchronous on a gigabit interface for the same machine.

dotnet-issue-labeler[bot] commented 2 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.

Issue Details

I'm finding that any form of async tcp or unix socket in c#/.net 6 is using 5-8x the cpu usage compared to using synchronous sockets in c# or doing essentially the same thing in python3, nodejs, or netcat in listen mode. Async sockets in c# on windows perform fine. I've tried several different forms of async with sockets in c# and they all seem to behave the same. In my test scenario I'm saturating a gigabit ethernet link via netcat because it's very easy to reproduce. My production scenario is a video surveillance application receiving many simultaneous hd video feeds. I have experienced this issue in my production application with and without docker, on physical machines and in hyper-v virtual machines, on centos 7.9, rocky 8.6, and ubuntu 20.04. My examples below will all be ubuntu 20.04 with dotnet 6 installed on the host. Hyper-v virtual machine on windows 11 with 4 hyper-v cpus on a ryzen 5600g I'm using cpu percentages from the top command and these are all processing the same gigabit traffic (no example here is processing more or less MB/s - they are all a full gigabit link verified with application counters and tools like iftop) using a standard TCP Listener -> NetworkStream with sync reads and a ~1MB buffer ~6-8% cpu usage. I have no problem with this and as you'll see below this result is consistent with other tools and programming languages in my test environment. ``` byte[] buffer = new byte[1048576]; while ((nBytes **= stream.Read(**buffer, 0, buffer.Length)) > 0) { totalBytes += nBytes; } ``` using a standard TCP Listener -> NetworkStream with async reads and a ~1MB buffer. The only difference is await ReadAsync vs Read ~50-60% cpu usage - **5-8x as much cpu as the synchronous version**. ``` byte[] buffer = new byte[1048576]; while((nBytes = **await stream.ReadAsync(**buffer, 0, buffer.Length)) > 0) { totalBytes += nBytes; } ``` I have also tried socket.BeginReceive with approximately the same 50-60% cpu usage I have also tried socket.ReceiveAsync with SocketAsyncEventArgs with approximately the same 50-60% cpu usage I have tried unix Sockets in c#/.net 6 With unix sockets the cpu usage for a synchronous read loop is around 6% of cpu (vs 6-8% for tcp sockets). I piped a netcat receiver into the unix socket to ensure the same amount of data was being processed. With unix sockets the cpu usage for an asynchronous read loop is around 50-60% of cpu (~ the same as async tcp sockets). I piped a netcat receiver into the unix socket to ensure the same amount of data was being processed. ### I ran some comparables using netcat as a listener nc -l 9989 >/dev/null uses about 8% cpu (same as synchronous sockets in c#) using nodejs and the socket.on data callback uses about 10% cpu (slightly more than synchronous c# but 5-6x less than async c# and this is async and not re-using a receive buffer so it's less efficient) using python3 with a synchronous tcp socket receiving into a buffer (very comparable to my c# synchronous example) uses about 8% cpu (same as synchronous sockets in c#) My basic test methodology is From another machine run netcat to generate a full gigabit/s of ethernet traffic `cat /dev/zero | nc 192.168.1.93 9989` On the destination machine 192.168.1.93 (the ubuntu 20.04 vm all the tests were run on) High CPU usage dotnet run --configuration Release PerfTestTCPListenerAsync Low CPU usage dotnet run --configuration Release PerfTestTCPListener nc -l 9989 >/dev/null nodejs jsserver.js python3 pyserver.py I've included the most basic c# example Program.cs using the exact same code with a Read vs ReadAsync, a very basic python example pyserver.py and a very basic nodejs example jsserver.js. All are configured to listen on port 9989 so the same netcat client command can be run against any of the 5 examples. I've been monitoring the cpu usage with top and network traffic with iftop. [data.zip](https://github.com/dotnet/core/files/9106126/data.zip) Async version dotnet-counters monitor (I'm note sure if the cpu usage here is quoting all cores but top value shows 52% for this async socket process) [System.Runtime] % Time in GC since last GC (%) 0 Allocation Rate (B / 1 sec) 244,648 CPU Usage (%) 16 Exception Count (Count / 1 sec) 0 GC Committed Bytes (MB) 9 GC Fragmentation (%) 0.233 GC Heap Size (MB) 6 Gen 0 GC Count (Count / 1 sec) 0 Gen 0 Size (B) 24 Gen 1 GC Count (Count / 1 sec) 0 Gen 1 Size (B) 245,448 Gen 2 GC Count (Count / 1 sec) 0 Gen 2 Size (B) 24 IL Bytes Jitted (B) 48,860 LOH Size (B) 1,048,656 Monitor Lock Contention Count (Count / 1 sec) 0 Number of Active Timers 0 Number of Assemblies Loaded 18 Number of Methods Jitted 392 POH (Pinned Object Heap) Size (B) 39,976 ThreadPool Completed Work Item Count (Count / 1 sec) 6,132 ThreadPool Queue Length 0 ThreadPool Thread Count 5 Time spent in JIT (ms / 1 sec) 0 Working Set (MB) 47 Synchronous version (top shows cpu usage as around 6%) [System.Runtime] % Time in GC since last GC (%) 0 Allocation Rate (B / 1 sec) 32,672 CPU Usage (%) 0 Exception Count (Count / 1 sec) 0 GC Committed Bytes (MB) 26 GC Fragmentation (%) 1.105 GC Heap Size (MB) 25 Gen 0 GC Count (Count / 1 sec) 0 Gen 0 Size (B) 24 Gen 1 GC Count (Count / 1 sec) 0 Gen 1 Size (B) 7,806,528 Gen 2 GC Count (Count / 1 sec) 0 Gen 2 Size (B) 12,179,944 IL Bytes Jitted (B) 295,457 LOH Size (B) 287,848 Monitor Lock Contention Count (Count / 1 sec) 0 Number of Active Timers 4 Number of Assemblies Loaded 121 Number of Methods Jitted 3,737 POH (Pinned Object Heap) Size (B) 195,360 ThreadPool Completed Work Item Count (Count / 1 sec) 0 ThreadPool Queue Length 0 ThreadPool Thread Count 4 Time spent in JIT (ms / 1 sec) 0 Working Set (MB) 140 Linux Traces: I've attached 20 second dotnet-trace results from both Synchronous and Asynchronous in Linux. It appears to me all (or most)of the extra time is spent in LowLevelLifoSemaphore but I'm not sure if that's expected. [Traces.zip](https://github.com/dotnet/core/files/9106253/Traces.zip) Versus Windows (0.1% cpu in async mode, 12700k) ![image](https://user-images.githubusercontent.com/1630971/178843125-d2c47a9f-e7cc-40d7-94b4-0ff6ae614b33.png) [TraceAsyncWindows.zip](https://github.com/dotnet/core/files/9106362/TraceAsyncWindows.zip) Of course the windows trace looks very different as it uses IOCompletion I ran some tests using a similarly spec'd windows 10 hyper-v VM and got pretty lousy numbers. Both Synchronous and Async used about 50% of the cpu (4 hyper-v cpus on a 5700g) and only got up to about 600mbps. While it doesn't invalidate the enormous synchronous vs asynchronous performance discrepancy I observed on linux it may invalidate the windows vs linux argument. These aren't heavily loaded hyper-visors so I'm surprised to be honest. Python3 got 35% cpu usage and 800mbps on the same windows 10 machine. Using a 5700g with windows 11 (not in a hyper-vm) shows similar 0.3%ish cpu usage and full gigabit traffic for .net 6 async and slightly higher cpu usage for python3 (0.4ish) for full gigabit traffic. Using an intel 12400 (non virtualized) rocky linux 8.5 I had 40% cpu usage in async and 1% for synchronous on 100mbit interface and 47% async vs 1% synchronous on a gigabit interface for the same machine.

Author:	andrewaggb
Assignees:	-
Labels:	`area-System.Net`, `untriaged`
Milestone:	-

davidfowl commented 2 years ago

Can you try setting the environment variable DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS to 1 and re-running your test?

andrewaggb commented 2 years ago

Can you try setting the environment variable DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS to 1 and re-running your test?

I tried setting DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS=1 and running the async example on my intel 12400 bare metal machine running rocky linux 8.5. The async cpu usage went from 47% cpu to 2% cpu. So that's pretty huge and puts the performance right where I'd expect.

I also tried the ubuntu VM that I did my earlier testing in and the cpu usage dropped from about 52% to 7% which is comparable to the synchronous results and better than nodejs.

It sounds like that flag is experimental and potentially unstable according to https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-5/, I'm guessing that's still true for .net 6?

andrewaggb commented 2 years ago

I tried my actual application with the flag enabled and while I do see a significant drop in cpu usage for the application as a whole (4-5x) it seems to be unstable as-is with some video feeds reconnecting and whatnot so I'd have to experiment some more and see what works and what doesn't. It's too early to say without further investigation what the actual cpu reduction will be as some feeds might not be getting processed in which case some of that reduction is because it's doing less.

wfurt commented 2 years ago

triage: we should investigate and perhaps get the INLINE_COMPLETIONS to useable state.

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.

Issue Details

I'm finding that any form of async tcp or unix socket in c#/.net 6 is using 5-8x the cpu usage compared to using synchronous sockets in c# or doing essentially the same thing in python3, nodejs, or netcat in listen mode. Async sockets in c# on windows perform fine. I've tried several different forms of async with sockets in c# and they all seem to behave the same. In my test scenario I'm saturating a gigabit ethernet link via netcat because it's very easy to reproduce. My production scenario is a video surveillance application receiving many simultaneous hd video feeds. I have experienced this issue in my production application with and without docker, on physical machines and in hyper-v virtual machines, on centos 7.9, rocky 8.6, and ubuntu 20.04. My examples below will all be ubuntu 20.04 with dotnet 6 installed on the host. Hyper-v virtual machine on windows 11 with 4 hyper-v cpus on a ryzen 5600g I'm using cpu percentages from the top command and these are all processing the same gigabit traffic (no example here is processing more or less MB/s - they are all a full gigabit link verified with application counters and tools like iftop) using a standard TCP Listener -> NetworkStream with sync reads and a ~1MB buffer ~6-8% cpu usage. I have no problem with this and as you'll see below this result is consistent with other tools and programming languages in my test environment. ``` byte[] buffer = new byte[1048576]; while ((nBytes **= stream.Read(**buffer, 0, buffer.Length)) > 0) { totalBytes += nBytes; } ``` using a standard TCP Listener -> NetworkStream with async reads and a ~1MB buffer. The only difference is await ReadAsync vs Read ~50-60% cpu usage - **5-8x as much cpu as the synchronous version**. ``` byte[] buffer = new byte[1048576]; while((nBytes = **await stream.ReadAsync(**buffer, 0, buffer.Length)) > 0) { totalBytes += nBytes; } ``` I have also tried socket.BeginReceive with approximately the same 50-60% cpu usage I have also tried socket.ReceiveAsync with SocketAsyncEventArgs with approximately the same 50-60% cpu usage I have tried unix Sockets in c#/.net 6 With unix sockets the cpu usage for a synchronous read loop is around 6% of cpu (vs 6-8% for tcp sockets). I piped a netcat receiver into the unix socket to ensure the same amount of data was being processed. With unix sockets the cpu usage for an asynchronous read loop is around 50-60% of cpu (~ the same as async tcp sockets). I piped a netcat receiver into the unix socket to ensure the same amount of data was being processed. ### I ran some comparables using netcat as a listener nc -l 9989 >/dev/null uses about 8% cpu (same as synchronous sockets in c#) using nodejs and the socket.on data callback uses about 10% cpu (slightly more than synchronous c# but 5-6x less than async c# and this is async and not re-using a receive buffer so it's less efficient) using python3 with a synchronous tcp socket receiving into a buffer (very comparable to my c# synchronous example) uses about 8% cpu (same as synchronous sockets in c#) My basic test methodology is From another machine run netcat to generate a full gigabit/s of ethernet traffic `cat /dev/zero | nc 192.168.1.93 9989` On the destination machine 192.168.1.93 (the ubuntu 20.04 vm all the tests were run on) High CPU usage dotnet run --configuration Release PerfTestTCPListenerAsync Low CPU usage dotnet run --configuration Release PerfTestTCPListener nc -l 9989 >/dev/null nodejs jsserver.js python3 pyserver.py I've included the most basic c# example Program.cs using the exact same code with a Read vs ReadAsync, a very basic python example pyserver.py and a very basic nodejs example jsserver.js. All are configured to listen on port 9989 so the same netcat client command can be run against any of the 5 examples. I've been monitoring the cpu usage with top and network traffic with iftop. [data.zip](https://github.com/dotnet/core/files/9106126/data.zip) Async version dotnet-counters monitor (I'm note sure if the cpu usage here is quoting all cores but top value shows 52% for this async socket process) [System.Runtime] % Time in GC since last GC (%) 0 Allocation Rate (B / 1 sec) 244,648 CPU Usage (%) 16 Exception Count (Count / 1 sec) 0 GC Committed Bytes (MB) 9 GC Fragmentation (%) 0.233 GC Heap Size (MB) 6 Gen 0 GC Count (Count / 1 sec) 0 Gen 0 Size (B) 24 Gen 1 GC Count (Count / 1 sec) 0 Gen 1 Size (B) 245,448 Gen 2 GC Count (Count / 1 sec) 0 Gen 2 Size (B) 24 IL Bytes Jitted (B) 48,860 LOH Size (B) 1,048,656 Monitor Lock Contention Count (Count / 1 sec) 0 Number of Active Timers 0 Number of Assemblies Loaded 18 Number of Methods Jitted 392 POH (Pinned Object Heap) Size (B) 39,976 ThreadPool Completed Work Item Count (Count / 1 sec) 6,132 ThreadPool Queue Length 0 ThreadPool Thread Count 5 Time spent in JIT (ms / 1 sec) 0 Working Set (MB) 47 Synchronous version (top shows cpu usage as around 6%) [System.Runtime] % Time in GC since last GC (%) 0 Allocation Rate (B / 1 sec) 32,672 CPU Usage (%) 0 Exception Count (Count / 1 sec) 0 GC Committed Bytes (MB) 26 GC Fragmentation (%) 1.105 GC Heap Size (MB) 25 Gen 0 GC Count (Count / 1 sec) 0 Gen 0 Size (B) 24 Gen 1 GC Count (Count / 1 sec) 0 Gen 1 Size (B) 7,806,528 Gen 2 GC Count (Count / 1 sec) 0 Gen 2 Size (B) 12,179,944 IL Bytes Jitted (B) 295,457 LOH Size (B) 287,848 Monitor Lock Contention Count (Count / 1 sec) 0 Number of Active Timers 4 Number of Assemblies Loaded 121 Number of Methods Jitted 3,737 POH (Pinned Object Heap) Size (B) 195,360 ThreadPool Completed Work Item Count (Count / 1 sec) 0 ThreadPool Queue Length 0 ThreadPool Thread Count 4 Time spent in JIT (ms / 1 sec) 0 Working Set (MB) 140 Linux Traces: I've attached 20 second dotnet-trace results from both Synchronous and Asynchronous in Linux. It appears to me all (or most)of the extra time is spent in LowLevelLifoSemaphore but I'm not sure if that's expected. [Traces.zip](https://github.com/dotnet/core/files/9106253/Traces.zip) Versus Windows (0.1% cpu in async mode, 12700k) ![image](https://user-images.githubusercontent.com/1630971/178843125-d2c47a9f-e7cc-40d7-94b4-0ff6ae614b33.png) [TraceAsyncWindows.zip](https://github.com/dotnet/core/files/9106362/TraceAsyncWindows.zip) Of course the windows trace looks very different as it uses IOCompletion I ran some tests using a similarly spec'd windows 10 hyper-v VM and got pretty lousy numbers. Both Synchronous and Async used about 50% of the cpu (4 hyper-v cpus on a 5700g) and only got up to about 600mbps. While it doesn't invalidate the enormous synchronous vs asynchronous performance discrepancy I observed on linux it may invalidate the windows vs linux argument. These aren't heavily loaded hyper-visors so I'm surprised to be honest. Python3 got 35% cpu usage and 800mbps on the same windows 10 machine. Using a 5700g with windows 11 (not in a hyper-vm) shows similar 0.3%ish cpu usage and full gigabit traffic for .net 6 async and slightly higher cpu usage for python3 (0.4ish) for full gigabit traffic. Using an intel 12400 (non virtualized) rocky linux 8.5 I had 40% cpu usage in async and 1% for synchronous on 100mbit interface and 47% async vs 1% synchronous on a gigabit interface for the same machine.

Author:	andrewaggb
Assignees:	-
Labels:	`area-System.Net`, `area-System.Net.Sockets`
Milestone:	-

davidfowl commented 2 years ago

or doing essentially the same thing in python3, nodejs, or netcat in listen mode

I'd be interested in the results using golang if you have time to test that. Python3 and nodejs are both single threaded.

andrewaggb commented 2 years ago

or doing essentially the same thing in python3, nodejs, or netcat in listen mode

I'd be interested in the results using golang if you have time to test that. Python3 and nodejs are both single threaded.

I put a simple golang example together (attached). I don't believe it's equivalent to the c# async/await as it runs a goroutine per connection so it's probably more akin to synchronous sockets in c# with a thread per client. I looked a bit for an async go socket example and didn't find anything mainstream but I might not be searching for the right thing.

On the latest stable go version go1.18.4 linux/amd64 on the same ubuntu VM, I observed 14-15% cpu usage for a comparable example to synchronous c# or synchronous python3 (code attached). Synchronous sockets in c# were ~6-8% cpu, async c# was ~52% cpu, and async c# with DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS=1 was ~7% cpu. In all cases I'm using default socket settings (not adjusting buffers, etc). Looks to me like go uses twice the cpu (with exception of async c# on linux that was much higher) using the default settings.

GoServer.zip

I know the expectation in c# is to use async/await for I/O operations and on linux the performance appears to be much worse so I personally think it should be prioritized. Makes me curious what kestrel is doing socket-wise as I see those web benchmarks that appear to run just fine on linux with many simultaneous connections and presumably pretty high throughput as it scales.

stephentoub commented 2 years ago

We'd need traces to confirm, but I expect a significant portion of the extra CPU is due to the thread pool looking for work and not finding any, in which case these cycles are basically spare that wouldn't otherwise be used (by this process). It'd be interesting to change your test around to be one of scale / load, maxing out CPU in all the environments, and then using a different measure of perf to compare, e.g. overall throughput of the system. I'm in no way saying this isn't an issue we should improve in the future (we want to), but rather that it may not actually be as impactful as the repro suggests.

andrewaggb commented 2 years ago

We'd need traces to confirm, but I expect a significant portion of the extra CPU is due to the thread pool looking for work and not finding any, in which case these cycles are basically spare that wouldn't otherwise be used (by this process). It'd be interesting to change your test around to be one of scale / load, maxing out CPU in all the environments, and then using a different measure of perf to compare, e.g. overall throughput of the system. I'm in no way saying this isn't an issue we should improve in the future (we want to), but rather that it may not actually be as impactful as the repro suggests.

I suspect most people aren't getting anywhere near 5-8x higher cpu but I have two very different applications (video surveillance processing, ~50 concurrent video feeds in the 0.5-8Mpbs range) and electric motor power monitoring (sort of a network oscilloscope) that show higher cpu than expected. As many workloads are in VM's/docker/cloud etc and essentially everything is networked these days I think it could be substantial cpu cycles even if my scenarios are closer to the worst case.

I'm going to try modifying my applications to use synchronous sockets in the short term and see how things work out as I don't need massive scalability (thread per client is fine).

I did attach some traces in my original post if you are interested.

andrewaggb commented 2 years ago

I replaced the async socket receives with synchronous receives and a new thread per client in my video application and updated a few instances to review the difference.

Summary: 2.7x-6x lower cpu usage for the entire application for the 3 instances I updated. For me it's a no brainer to switch over to separate threads plus synchronous sockets. I've been moving over a few hundred systems from an nginx+rtmp proxy to an integrated c# mpegts receiver (rest of the app was c#/mvc/websockets/etc) and noticed much higher cpu usage. This change brings the cpu usage down to about where I expected it to be.

I'm not sure if async await on sockets in httpclient, mvc websockets/signalr, npgsql, etc are also affected but I have to think any app using async sockets might be affected to some degree or another.

~27% cpu to ~10% cpu processing ~80mbps across 72 connections. Core i5 7500. Running in docker on Rocky Linux 8.6 (bare metal).

~30% cpu to ~5% cpu processing ~34mbps across 51 connections. Core i5-12400. Running in docker on Rocky Linux 8.5 (bare metal).

~12% cpu to 4% cpu. ~12.5mpbs across 12 connections. Running in docker on Ubuntu 20.04 in a VM. 12 connections. 4 virtual cpus.

chris-stolarz commented 2 years ago

We'd need traces to confirm, but I expect a significant portion of the extra CPU is due to the thread pool looking for work and not finding any, in which case these cycles are basically spare that wouldn't otherwise be used (by this process).

you could try to limit CPU spent by threadpool by changing spin limit: DOTNET_ThreadPool_UnfairSemaphoreSpinLimit=0 (or some other, lower than default number)

it drastically improved the CPU usage of our system, but it does have a negative effect on the max throughput

andrewaggb commented 2 years ago

That makes a large difference for me as well. It doesn't bring my results down to near synchronous levels but in my example above (running a single thread async receive loop) the one case drops from around 50-60% to 35%ish. I didn't observe a decrease in throughput but I'm sure there are cases where it could. I don't suppose there's a way at runtime to scale this based on load?

chris-stolarz commented 2 years ago

I dont think there is: https://github.com/dotnet/runtime/blob/080f708e7018f6c0529b6c875a44d84fc4d74419/src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.WorkerThread.cs#L34 We are running this setting in production on a couple of our services, only one showed throughput issues - it could be caused by something else

davidfowl commented 2 years ago

cc @kouvel

kouvel commented 2 years ago

Another option that may work for you on Linux in .NET 6 is DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS=1. It bypasses the worker thread pool and processes IO completions on as many threads as the processor count. There are some downsides to using it (for instance it doesn't handle stalling). On Windows, from .NET 7 IO completions are handled more similarly to Linux and I see similar higher CPU usage because they go through the worker thread pool. The above config can also be used on Windows from .NET 7.

I agree the worker thread pool should be using less CPU time in these types of cases. There are likely several opportunities for tuning it for using less CPU time in various types of cases, esp. in cases where the worker thread pool is not fully loaded.

ghost commented 2 years ago

Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.

Issue Details

I'm finding that any form of async tcp or unix socket in c#/.net 6 is using 5-8x the cpu usage compared to using synchronous sockets in c# or doing essentially the same thing in python3, nodejs, or netcat in listen mode. Async sockets in c# on windows perform fine. I've tried several different forms of async with sockets in c# and they all seem to behave the same. In my test scenario I'm saturating a gigabit ethernet link via netcat because it's very easy to reproduce. My production scenario is a video surveillance application receiving many simultaneous hd video feeds. I have experienced this issue in my production application with and without docker, on physical machines and in hyper-v virtual machines, on centos 7.9, rocky 8.6, and ubuntu 20.04. My examples below will all be ubuntu 20.04 with dotnet 6 installed on the host. Hyper-v virtual machine on windows 11 with 4 hyper-v cpus on a ryzen 5600g I'm using cpu percentages from the top command and these are all processing the same gigabit traffic (no example here is processing more or less MB/s - they are all a full gigabit link verified with application counters and tools like iftop) using a standard TCP Listener -> NetworkStream with sync reads and a ~1MB buffer ~6-8% cpu usage. I have no problem with this and as you'll see below this result is consistent with other tools and programming languages in my test environment. ``` byte[] buffer = new byte[1048576]; while ((nBytes **= stream.Read(**buffer, 0, buffer.Length)) > 0) { totalBytes += nBytes; } ``` using a standard TCP Listener -> NetworkStream with async reads and a ~1MB buffer. The only difference is await ReadAsync vs Read ~50-60% cpu usage - **5-8x as much cpu as the synchronous version**. ``` byte[] buffer = new byte[1048576]; while((nBytes = **await stream.ReadAsync(**buffer, 0, buffer.Length)) > 0) { totalBytes += nBytes; } ``` I have also tried socket.BeginReceive with approximately the same 50-60% cpu usage I have also tried socket.ReceiveAsync with SocketAsyncEventArgs with approximately the same 50-60% cpu usage I have tried unix Sockets in c#/.net 6 With unix sockets the cpu usage for a synchronous read loop is around 6% of cpu (vs 6-8% for tcp sockets). I piped a netcat receiver into the unix socket to ensure the same amount of data was being processed. With unix sockets the cpu usage for an asynchronous read loop is around 50-60% of cpu (~ the same as async tcp sockets). I piped a netcat receiver into the unix socket to ensure the same amount of data was being processed. ### I ran some comparables using netcat as a listener nc -l 9989 >/dev/null uses about 8% cpu (same as synchronous sockets in c#) using nodejs and the socket.on data callback uses about 10% cpu (slightly more than synchronous c# but 5-6x less than async c# and this is async and not re-using a receive buffer so it's less efficient) using python3 with a synchronous tcp socket receiving into a buffer (very comparable to my c# synchronous example) uses about 8% cpu (same as synchronous sockets in c#) using golang with a goroutine per connection receiving into a buffer (comparable to my c# synchronous example) uses about 14% cpu (approximately double synchronous sockets in c#) My basic test methodology is From another machine run netcat to generate a full gigabit/s of ethernet traffic `cat /dev/zero | nc 192.168.1.93 9989` On the destination machine 192.168.1.93 (the ubuntu 20.04 vm all the tests were run on) High CPU usage dotnet run --configuration Release PerfTestTCPListenerAsync Low CPU usage dotnet run --configuration Release PerfTestTCPListener nc -l 9989 >/dev/null nodejs jsserver.js python3 pyserver.py I've included the most basic c# example Program.cs using the exact same code with a Read vs ReadAsync, a very basic python example pyserver.py and a very basic nodejs example jsserver.js. All are configured to listen on port 9989 so the same netcat client command can be run against any of the 5 examples. I've been monitoring the cpu usage with top and network traffic with iftop. [data.zip](https://github.com/dotnet/core/files/9106126/data.zip) Async version dotnet-counters monitor (I'm note sure if the cpu usage here is quoting all cores but top value shows 52% for this async socket process) [System.Runtime] % Time in GC since last GC (%) 0 Allocation Rate (B / 1 sec) 244,648 CPU Usage (%) 16 Exception Count (Count / 1 sec) 0 GC Committed Bytes (MB) 9 GC Fragmentation (%) 0.233 GC Heap Size (MB) 6 Gen 0 GC Count (Count / 1 sec) 0 Gen 0 Size (B) 24 Gen 1 GC Count (Count / 1 sec) 0 Gen 1 Size (B) 245,448 Gen 2 GC Count (Count / 1 sec) 0 Gen 2 Size (B) 24 IL Bytes Jitted (B) 48,860 LOH Size (B) 1,048,656 Monitor Lock Contention Count (Count / 1 sec) 0 Number of Active Timers 0 Number of Assemblies Loaded 18 Number of Methods Jitted 392 POH (Pinned Object Heap) Size (B) 39,976 ThreadPool Completed Work Item Count (Count / 1 sec) 6,132 ThreadPool Queue Length 0 ThreadPool Thread Count 5 Time spent in JIT (ms / 1 sec) 0 Working Set (MB) 47 Synchronous version (top shows cpu usage as around 6%) [System.Runtime] % Time in GC since last GC (%) 0 Allocation Rate (B / 1 sec) 32,672 CPU Usage (%) 0 Exception Count (Count / 1 sec) 0 GC Committed Bytes (MB) 26 GC Fragmentation (%) 1.105 GC Heap Size (MB) 25 Gen 0 GC Count (Count / 1 sec) 0 Gen 0 Size (B) 24 Gen 1 GC Count (Count / 1 sec) 0 Gen 1 Size (B) 7,806,528 Gen 2 GC Count (Count / 1 sec) 0 Gen 2 Size (B) 12,179,944 IL Bytes Jitted (B) 295,457 LOH Size (B) 287,848 Monitor Lock Contention Count (Count / 1 sec) 0 Number of Active Timers 4 Number of Assemblies Loaded 121 Number of Methods Jitted 3,737 POH (Pinned Object Heap) Size (B) 195,360 ThreadPool Completed Work Item Count (Count / 1 sec) 0 ThreadPool Queue Length 0 ThreadPool Thread Count 4 Time spent in JIT (ms / 1 sec) 0 Working Set (MB) 140 Linux Traces: I've attached 20 second dotnet-trace results from both Synchronous and Asynchronous in Linux. It appears to me all (or most)of the extra time is spent in LowLevelLifoSemaphore but I'm not sure if that's expected. [Traces.zip](https://github.com/dotnet/core/files/9106253/Traces.zip) Versus Windows (0.1% cpu in async mode, 12700k) ![image](https://user-images.githubusercontent.com/1630971/178843125-d2c47a9f-e7cc-40d7-94b4-0ff6ae614b33.png) [TraceAsyncWindows.zip](https://github.com/dotnet/core/files/9106362/TraceAsyncWindows.zip) Of course the windows trace looks very different as it uses IOCompletion I ran some tests using a similarly spec'd windows 10 hyper-v VM and got pretty lousy numbers. Both Synchronous and Async used about 50% of the cpu (4 hyper-v cpus on a 5700g) and only got up to about 600mbps. While it doesn't invalidate the enormous synchronous vs asynchronous performance discrepancy I observed on linux it may invalidate the windows vs linux argument. These aren't heavily loaded hyper-visors so I'm surprised to be honest. Python3 got 35% cpu usage and 800mbps on the same windows 10 machine. Using a 5700g with windows 11 (not in a hyper-vm) shows similar 0.3%ish cpu usage and full gigabit traffic for .net 6 async and slightly higher cpu usage for python3 (0.4ish) for full gigabit traffic. Using an intel 12400 (non virtualized) rocky linux 8.5 I had 40% cpu usage in async and 1% for synchronous on 100mbit interface and 47% async vs 1% synchronous on a gigabit interface for the same machine.

Author:	andrewaggb
Assignees:	-
Labels:	`area-System.Threading`
Milestone:	Future

andrewaggb commented 2 years ago

I tried that earlier in the thread at David Fowlers request. It reduced cpu usage to nearly synchronous socket levels in my simple example (async read loop) but in my actual application I found it wasn't usable as-is. Connections appeared to be stalling or not responding. My actual application involves web api, fleck websockets and raw sockets of various kinds, etc so it's signficantly more complex and varied than my basic example that illustrates the cpu usage difference.

kouvel commented 2 years ago

Ah ok, missed that. What kind of CPU usage difference do you see in the actual application?

kouvel commented 2 years ago

Found those numbers above. In .NET 7 the thread pool requests fewer threads, it may help to reduce the CPU usage a bit in the actual application. In the test case, .NET 7 combined with DOTNET_ThreadPool_UnfairSemaphoreSpinLimit=0 still uses ~3x the CPU on my Linux VM, but that's because there's only one connection and it's kind of a worst case for the thread pool where for each IO completion two extra threads spin up unnecessarily and look for work. In the real app the two extra threads spinning up may not contribute as much by comparison.

wfurt commented 2 years ago

I'm planning to take a look and (hopefully) get it working for 8.0. It seems like the benefits may be interesting.

andrewaggb commented 2 years ago

I saw about a 3x cpu usage reduction in my actual application (in various hardware configurations) when switching from async receiving sockets to a thread per connection with synchronous receives model. The application receives video surveillance feeds over a local network (typically 20-60 connections) and almost always totaling less than 100mbps total bandwidth.

I didn't try the DOTNET_ThreadPool_UnfairSemaphoreSpinLimit=0 on my actual app but I could deploy a build with async sockets and try it out in the next few days.

davidfowl commented 1 year ago

@andrewaggb did you ever try it?

AachoLoya commented 1 year ago

Any update on this? Considering using .NET sockets on linux for a high load/throughput application and wondering if its production ready.

stephentoub commented 1 year ago

wondering if its production ready.

It is

for a high load/throughput application

The increased CPU utilization cited in this issue is when not under high load.

AachoLoya commented 1 year ago

So its just idle spin-checking of some sort taking up the cpu?

I saw David Fowler's twitter post, from a couple of years ago, about looking into io_uring and aio. Are these integrated in .NET 6?

stephentoub commented 1 year ago

So its just idle spin-checking of some sort taking up the cpu?

Effectively, yes: it's primarily the thread pool searching for more work to do when it hasn't found any.

Are these integrated in .NET 6?

No.

AachoLoya commented 1 year ago

What's the current plan for integrating aio or io_uring? Sorry to ask here, but I googled a lot and other than a few open issues on github there isn't anything definitive out there.

stephentoub commented 1 year ago

What's the current plan for integrating aio or io_uring?

There's no "plan" currently; we experimented with doing so in the .NET 5.0 time-frame, and there were both challenges with adapting io_uring at that time to the multithreaded nature of .NET and also limited benefits seen to using it at that time. https://github.com/dotnet/runtime/issues/753 is still open and tracks experimenting with it again in the future.

mangod9 commented 1 year ago

Can this issue be closed? Doesnt look like there is anything actionable here?

wfurt commented 1 year ago

There are several options @mangod9. One obvious is to get DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS into useable state - but that will probably not happen for 8.0 because of priorities and rescue constrains.

mangod9 commented 1 year ago

Ok, moving out of 8 for now then. Thx!

dotnet / runtime

async sockets in linux (windows is fine) use 5-8x expected cpu vs synchronous sockets or comparables in other languages #72153

I ran some comparables