cloudflare / workerd

The JavaScript / Wasm runtime that powers Cloudflare Workers
https://blog.cloudflare.com/workerd-open-source-workers-runtime/
Apache License 2.0
6.25k stars 300 forks source link

Expected Performance/Throughput #72

Open billywhizz opened 2 years ago

billywhizz commented 2 years ago

I have been doing some benchmarking of workerd against other JS runtimes and I am not seeing very good results. Do you have any recommendations for config for peak performance for benchmarking, or an expectation of what kind of numbers we should see?

RPS using wrk for a simple hello world benchmark is currently showing only one tenth what i see with Deno or Bun.sh and tail latencies are also very high.

Memory usage is also very high compared to Deno - ~220 MB for a simple hello world.

kentonv commented 2 years ago

Hi @billywhizz,

A couple questions:

As noted in the readme, workerd really isn't ready for benchmarking yet as we know there is a bunch of low-hanging fruit in terms of performance tuning. Our internal build (which is old, not bazel-based, a huge mess, but does a lot more tuning) actually produces much faster binaries right now. Some of the things we need to do here include:

billywhizz commented 2 years ago

thanks for the detailed response Kenton! i didn't include details as i wanted to see if there were any recommendations first. the benchmarks i have run are on Core i5 on Ubuntu 22 docker in privileged mode running on Ubuntu 18 host (i.e. my laptop!).

They are all single process serving a very basic hello world reponse. I used your example for workerd.

I used npm workerd as I was having issues getting workerd to build from scratch but can try that too once I have it building.

Afaik Deno uses a separate thread for IO and Bun is all on one thread. Node.js 16 and 18 are also around 3-4x better throughput on a single thread. I'll see if i can share full results later today.

Congrats on the release - am looking forward to diving into it in more detail.

kentonv commented 2 years ago

Hmm your results seem significantly worse than my own tests though I'm not sure how much it's worth digging in until we've done some more tuning. I wonder if we accidentally published an unoptimized binary to npm. Since we don't have CI set up to do the publishing yet there could have been some human error here.

kentonv commented 2 years ago

BTW note that there's some inherent overhead from managing multiple isolates and having to enter/exit specific isolates which single-isolate runtimes don't have to deal with. So we shouldn't be expecting parity on this kind of benchmark, but it should be much closer.

The memory issue is a separate issue but is something we've noticed and are working on. Basically V8 isn't garbage-collecting aggressively enough by default. You can tune this with certain V8 flags but we should make it work better out-of-the-box.

kentonv commented 2 years ago

Also, as always, note that benchmarks like this may not be telling you anything useful when it comes to real-world use. A "hello world" benchmark is essentially benchmarking the HTTP implementation, but in a real application the HTTP implementation is likely a tiny fraction of overall CPU usage so having a slightly faster or slower HTTP isn't going to make a big difference. In real apps you're going to spend most of your time executing JavaScript, and V8 is what ultimately matters there.

billywhizz commented 2 years ago

thanks for the responses. yes - is early days and these microbenchmarks are not really applicable to real world scenarios as you say, but they do tend to flag up overhead when done comparatively and for some use cases that extra latency can be important, especially when you are being billed by the second for it.

when benching with this

wrk -c 256 -t 2 -d 30 http;//127.0.0.1:3000/

i get ~12k RPS and ~40ms P99 latency. that's about 0.35 of node.js throughput and 20x node.js P99 latency. I am on a pretty old kernel so i'll try to test on a more recent setup.

Wallacy commented 1 year ago

Hi @billywhizz,

A couple questions:

  • What platform are you testing on? (Mac/Linux? Intel/arm?)
  • Are Deno and Bun configured to use multiple threads in your setup?
  • Did you build the binary yourself or did you use one from npm? If you built it, what flags did you use?

As noted in the readme, workerd really isn't ready for benchmarking yet as we know there is a bunch of low-hanging fruit in terms of performance tuning. Our internal build (which is old, not bazel-based, a huge mess, but does a lot more tuning) actually produces much faster binaries right now. Some of the things we need to do here include:

  • Enable distributing load across multiple threads/cores. Currently workerd uses only a single thread, so to utilize multiple cores you would need to run multiple instances of workerd.
  • Use a better memory allocator. Currently workerd uses the system allocator but we know tcmalloc or jemalloc is likely to produce much better results.
  • Tune compiler flags like LTO (link-time optimization).
  • On Mac, use kqueue instead of poll. I wrote Add support for kqueue in UnixEventPort. capnproto/capnproto#1555 a few days ago but this is not intregrated into workerd yet.

Also is nice to consider https://github.com/microsoft/mimalloc as provide a nice guard pages, randomized allocation, encrypted free lists.