DanSimon / scalene

Fast, lightweight Scala Http framework
MIT License
53 stars 3 forks source link

Taking part in TechEmpower's benchmarks #7

Closed plokhotnyuk closed 4 years ago

plokhotnyuk commented 4 years ago

https://github.com/TechEmpower/FrameworkBenchmarks/tree/master/frameworks/Scala/scalene

I've tried the /json end-point locally with wrk2 and built flame graphs which compare scalene with wizzardo at fixed rate 230K and 295K msg/sec accordingly (50 connections, 2 requesting threads) on my notebook Intel® Core™ i7-7700HQ (~90% CPU usage).

Scalene

image

Wizzardo

image

Here is their archive attached for interactive browsing:

flamegraphs.zip

DanSimon commented 4 years ago

The interactive graphs are really cool. And actually they pointed out something I think I can improve. Two of the skinny spikes in the middle are calls to change the interest ops on a connection's selection key, which requires also waking the selector if it's blocked. It gets called both when a connection has data to write and when it's finished writing, but 99% of the time these calls aren't actually needed, only when the underlying write buffer can't hold everything the connection wants to write. I'm pretty sure I can refactor things a little to eliminate those calls when they're not needed.

DanSimon commented 4 years ago

The fix I mentioned has now been merged in #9 along with a few other small improvements.

plokhotnyuk commented 4 years ago

Great work, Dan!

Now the benchmark for Scalene for JSON is able to keep 250K request per second (with less than 10ms at 99%, and less than 50ms max latency) on the same environment.

Also, now it is much easier to spot that more than 13% of CPU cycles are spent on parking and unparking of threads when working with LinkedBlockingQueue. It is highlighted by pink color here:

image

Have you considered an option to use non-blocking concurrent queues like here and with some back off strategy based on Thread.onSpinWait() which is available since JDK 9 (and with possible fallback to Thread.yield() for later versions of JDK using the sbt-multi-release-jar plugin)?

An archive with scalene-json.svg: scalene-json.zip

DanSimon commented 4 years ago

It's funny that you mention that since I was just looking into thread parking optimizations.

I have a branch that adds a simple spin-wait back-off before parking the event-loop thread: https://github.com/DanSimon/scalene/compare/0.1.3...exp/ev-backoff

Still doing some testing but I think it will mostly solve the issue. It should also greatly reduce parking caused by the LinkedBlockingQueue. I'll take a look Thread.onSpinWait though, it may help provide a better solution.

Also, for now I'll avoid pulling in any external non-blocking queue, since reading the LinkedBlockingQueue is only done once per event-loop iteration and even if it's not the most efficient implementation, the overhead shouldn't really matter once we've eliminated unnecessary blocking.

plokhotnyuk commented 4 years ago

Would you like to try this lightweight and efficient MPSC queue embedded immediately in sources of your actor and scheduled to handle messages on the java.util.concurrent.ForkJoinPool?

plokhotnyuk commented 4 years ago

ICYMI: Backward-Compatible Thread#onSpinWait with MethodHandles

DanSimon commented 4 years ago

Hey sorry I took a bit of a break from this for a while. I did a bunch of testing and was able to get some improvements in the JSON test by adding some spin-waiting, but otherwise I didn't see much of a difference. I've actually had a lot of trouble getting reliable results as I noticed that on some low-core CPU's that it's difficult to properly max out the server threads, which then causes more thread parking.

When I've tested on a GCP 64-core machine and limited Scalene to only a few threads, it looks like thread-parking becomes a much smaller percentage of CPU time. Adding spin-waiting had almost no effect in those tests. I'd still be open to trying out other implementations but at least for now I probably won't work on that.

Also, using a ForkJoinPool in Scalene actor's won't work. The actors are intentionally designed to be locked to one thread and Scalene relies on the fact that two actors share a single thread and will never run in parallel.