WebSocket performance issue with varying loads

haraldsteinlechner commented 5 years ago

I encountered two (most likely independent) performance problems with suave websockets. A setup which demonstrates the problem can be achieved as such:

create websocket connections (in javascript) which occasionally sends a message to suave (which replies immediately), in code they are called lurkers
create another websocket connection which repeateadely as fast as possible sends messages to suave, again answered immediately (so this connection is stressed compared to lurkers).

In this setup here the WebSocket example has been modified in order to show the problem in a consice manner. For demostration, use the test repo, start the modified websocket example and open Blub.html available in the Websocket directory in a web browser. In your chrome you will see: blub

Problem 1:

As you will see, typically the ping pong takes 0.21ms, however suddenly the application starts printing long iteration warning - showing more than 500ms roundtrip time. In the output log you will see stuff as such:

The thread 0x25f0 has exited with code 0 (0x0).
The thread 0x47c8 has exited with code 0 (0x0).
The thread 0x4e3c has exited with code 0 (0x0).
The thread 0x41e8 has exited with code 0 (0x0).
The thread 0xeb4 has exited with code 0 (0x0).
The thread 0x4fac has exited with code 0 (0x0).
The thread 0x868 has exited with code 0 (0x0).
The thread 0x4af8 has exited with code 0 (0x0).

On a machine with more cpus or setting higher MinThreads in the threadpool this problem goes away. We will shortly submit an issue to coreclr @krauthaufen

The problem is simply the way async operations work on sockets in combination with varying workloads as simulated by the two kinds of sockets.

Problem 2:

Using many WebSocket connections, beginning from approx 10 take forever. You can see the connection time for 32 clients (as you will see in Blub.html) in the screenshot which is about 5seconds-20secons.

22305.31 available workers: 32766, io's: 1000
[18:24:20 DBG] 127.0.0.1 connected. Now has 23 connected
[18:24:20 VRB] Reserving buffer at 172032, with 77 segments from call from Suave.Tcp.job
[18:24:20 VRB] Processing request... -> processor
[18:24:20 VRB] reading first line of request
[18:24:20 VRB] Reserving buffer at 180224, with 76 segments from call from Suave.Web.readMoreData
[18:24:20 VRB] reading headers
[18:24:20 VRB] Freeing buffer at 180224 from Suave.Web.split. Free segments 77.
[18:24:20 VRB] parsing post data
[18:24:20 VRB] Processed request. <- processor
22356.27 available workers: 32765, io's: 1000
22407.22 available workers: 32766, io's: 100

This is most likely due to problems in concurrent buffer management.

ademar commented 5 years ago

@haraldsteinlechner thanks for taking the time to file this bug report; I'll get to it as soon as I get back to civilization (in a couple of days) as I'm currently traveling with limited internet access. I think we will find the source of contention in the buffer manager.

haraldsteinlechner commented 5 years ago

just a quick update on problem 1: we further invastigated and rewrote suave internals in order to use the new socket.receiveAsync(buffer) (available in netstandard2 and net471) instead of the eventargs stuff which it currently uses. however we could not fix the problem in suave yet. somehow still there need to be thread pool threads getting blocked during operation. In effect the problem looks like this one: http://joeduffyblog.com/2006/07/08/clr-thread-pool-injection-stuttering-problems/

i started sicussion here https://gitter.im/dotnet/coreclr although i'm not sure this is the right place.

ademar commented 5 years ago

Hi @haraldsteinlechner

I have been playing with your POC for a few days now and I here is what I have found so far.

With regard to Problem 2

Using many WebSocket connections, beginning from approx 10 take forever.

I do not see exactly this. Answering to 32 clients takes about 4.5 seconds but then stays pretty linear after increasing the number of clients.

startup for 9 clients took: 259.6000000194181ms.
startup for 16 clients took: 787.9999999713618ms.
startup for 18 clients took: 3755.299999989802ms.
startup for 32 clients took: 4803.599999984726ms
startup for 64 clients took: 4787.400000001071ms.
startup for 128 clients took: 4965.800000005402ms.
startup for 232 clients took: 5023.999999975786ms.

connection-time-plot

The why it does this does not seem to reside within Suave; as far as I can see the time is spent waiting for accept to return. I rewrote the accept loop in its own dedicated thread with no F# asyncs and it exhibited the same connection time signature; it did so also after using plain blocking accepts.

As to Problem 1, I have managed to get rid of the long round trip in my tree, I am now figuring out which one of my changes did it. I'll get back to you on this.

haraldsteinlechner commented 5 years ago

cool. hope you can find out problem 1 :)

ademar commented 5 years ago

OK, the source of problem 1, the stuttering, is in the actual POC.

Changing the following line in the definition of observer.

System.Threading.Thread.Sleep 50

by

do! Async.Sleep 50

makes the periodic long round-trips go away (except for the first one). Which makes sense because Thread.Sleep is actually blocking a thread.

Does that work for you ?

krauthaufen commented 5 years ago

Hey, thanks for the update, but I couldn't find the code you're referring to. The only call to Thread.Sleep I could find is in aFewTimes which should only be relevant for startup. Did I miss anything?

haraldsteinlechner commented 5 years ago

@krauthaufen i think the reference is for this one:

  let observer =
    async {
      do! Async.SwitchToNewThread()`

And indeed you're right, this observer thread (hidden behind task) increases likelyhood of the stalls. I tried so much but if i recall correctly it stilll happened in long test runs over night. In the lack of a better small test case, let us assume, a second, blocking task is the problem.

In order not to block a threadpool thread i used SwitchToNewThread. I now switched to a dedicated background thread and the problem goes away. Also by using low level diagnostics we found that SwitchToNewThread threads are still tagged as thread pool threads. So this totally matches the observation - we might have bad effects on the thread pool here. Actually we looked at all our libraries and replaced SwitchToNewThread by a dedicated real thread. The conclusion is: one shall not block threadpol threads whatever happens and SwitchToNewThread ends up in the threadpool as well somehow.
- We shall not block thread pool threads, ok but what about computation. Having Thread.sleep mickmicks non-interruptable CPU heavy tasks (in non-preemption world). Clearly, since we shall not block tasks, we cannot do CPU work in tasks anymore. Again, this is discouraged anyways since long running tasks should use Task option long running. For normal TPL parallel programming however, i cannot estimate if tasks are long running a-priori. In order not to get unfair scheduling and stalls in other parts of the program i cannot use tasks at that point. In the last consequence, i cannot use any kind of threadpool/tpl stuff in presence of components such as webservers written in async/await style in order not to disrupt their responsiveness.

Your analysis and our investigations show that we kind of have a deeper problem here. Just to give you some more info:

we tried to mickmick suave websockets by using asynchronous read/write in artificial tests
we found that for some async operations the underlying implementation requires free threadpool threads (this is the case for named pipes). In this case we found a simple low level example which suffers from thread pool stalls if more than nrOfCores parallel async awaits happen.
in fact we found that in this scenario per parallel await we have additional delay of 500ms which is thread injection time delay.
for other async operations, such as TcpListener we could not produce a lowest level test case which suffers from stalls. This is mostly due to their different implementation mechanics of async await. At that point we though that maybe there is another issue with async await implementation in suave. This was the point when we rewrote suave not to use this mutable EventArgs variant for asnyc await... but no luck.

In case of a deeper problem (which i suspect) i'm sorry we bothered you in this issue.

What do you think?

@stefanmaierhofer @krauthaufen @aszabo314 @thomasOrtner

EDIT: quick update, there is definitely a bug in SwitchToNewThread on dotnet core. this caused the test to produce the stutters. in our real application we got rid of this but still suffer occasional stalls however.

ademar commented 5 years ago

Yeah you are right that we may be hitting a deeper problem.

Following the pattern here I rewrote observer like this

async {
      while true do
        let mutable workers, ios = 0,0 
        System.Threading.ThreadPool.GetAvailableThreads(&workers,&ios)
        printfn "%03.2f available workers: %A, io's: %A" sw.Elapsed.TotalMilliseconds workers ios
        do! Async.SwitchToNewThread()
        System.Threading.Thread.Sleep 50
        do! Async.SwitchToThreadPool()
    }

But it still does not solve the problem.

krauthaufen commented 5 years ago

Hi, we figured out that SwitchToNewThread does not actually create a new thread in the netstandard build of fsharp.core. it just uses a regular threadpool thread instead. This is caused by FX_NO_THREAD being defined in the netstandard build of fsharp. (no idea why)

Nonetheless Tasks are not scheduled in a fair way and therefore these delays cannot be completely avoided.as an example we simply ran a parralel.for loop which computes something while requesting something from suave. That way we were able to essentialy block it. requests took up to a minute to complete.

Shouldn't suave use a real lightweight thread library instead of TPL? The main goal of a webserver is to serve its clients in a fair way and not (as in TPL) optimize throughput. I don't actually know if this is even possible since all low-level things return tasks... However it would be worth a thought, since something like image processing using parallel primitives shouldn't prevent the webserver from answering requests.

Cheers

haraldsteinlechner commented 5 years ago

some news. the accpet performance is most likely the same problem. in this case suave's client tasks steal work pool threads from the system, making it stoned for new clients. This stems from the fact that scheduling is not fair here. Thus existing ping pong connections block their pool threads and we need to wait until accept continuation can be executed The problem manifests more drastically in the use of Task and ThreadPool stuff as work horse for two purposes:

reactive system/concurrency in serving web clients
parallel programming in other parts of the app for speeding up computations.

let testAsyncMadness () =
  let tcp = TcpListener(IPAddress.Loopback, 1000)
  tcp.Start()

  let server =
    let f () =
      tcp.AcceptSocket() |> ignore
      printfn "accepted"
      ()
    Thread(ThreadStart f, IsBackground = true).Start()

  let ppi () =
    let f () =
      let mutable sum = 0.0
      let steps = 2000000000
      let step = 1.0 / float steps
      let o = obj()
      let sw = System.Diagnostics.Stopwatch.StartNew()
      // dotnet provides TPL, somewhere i use a lib which uses TPL for parallel computing
      Parallel.For(0,steps, Func<float>(fun () -> 0.0), Func<int,ParallelLoopState,float,float>(fun i state local ->
        let x = (float i + 0.5) * step
        local + 4.0 / (1.0 + x * x)
      ), Action<float>(fun local -> lock o (fun _ -> sum <- sum + local))) |> ignore
      printfn "computed pi: %A and took: %As" (step * sum) sw.Elapsed.TotalSeconds
    let t = System.Threading.Thread(ThreadStart f)
    t.Start()
    System.Threading.Thread.Sleep 100 // get it goin
    t

  let background = ppi()

  let sw = System.Diagnostics.Stopwatch.StartNew()

  let client = new TcpClient()

  // here i use a different facet of tasks, acutally for concurrency (here on main thread). here the continuation
  // fights for threadpool of the parallel programming part
  client.ConnectAsync(IPAddress.Loopback,1000).ContinueWith(fun (t : Task) ->
    sw.Elapsed.TotalSeconds |> printfn "connecting tcp listener took: %A (%As)" t.Status
  ) |> ignore
  background.Join()
  tcp.Stop()

  (*
  oops, both library parts do not compose in terms of performance:
  > testAsyncMadness();;
    accepted
    connecting tcp listener took: RanToCompletion (took 1.0295347s) (!!!)
    computed pi: 3.141592654 and took: 3.0281443s
    val it : unit = ()
  *)

Clearly, both parts of the program do not compose. please note that the low level implementation using FromBeginEnd behaves fine.

Also, with real threads or proper lighweight threads the expected performance would be:

> testAsyncMadness2();;
accepted
connecting tcp listener took: 0.0329048s
computed pi: 3.141592654 and took: 3.0007865s
val it : unit = ()`

So in conclusion, if a thread pool thread, for whatever reason, e.g.

high system load with many concurrent connections, in presense of thread inlining
valid usage of tpl in other parts of the program

is blocked, continuations may be blocked for an arbitrary amount of time (typically a multiple of 500ms, being thread injection interval).

SuaveIO / suave

WebSocket performance issue with varying loads #711

Problem 1:

Problem 2: