Closed plokhotnyuk closed 4 years ago
The interactive graphs are really cool. And actually they pointed out something I think I can improve. Two of the skinny spikes in the middle are calls to change the interest ops on a connection's selection key, which requires also waking the selector if it's blocked. It gets called both when a connection has data to write and when it's finished writing, but 99% of the time these calls aren't actually needed, only when the underlying write buffer can't hold everything the connection wants to write. I'm pretty sure I can refactor things a little to eliminate those calls when they're not needed.
The fix I mentioned has now been merged in #9 along with a few other small improvements.
Great work, Dan!
Now the benchmark for Scalene for JSON is able to keep 250K request per second (with less than 10ms at 99%, and less than 50ms max latency) on the same environment.
Also, now it is much easier to spot that more than 13% of CPU cycles are spent on parking and unparking of threads when working with LinkedBlockingQueue
. It is highlighted by pink color here:
Have you considered an option to use non-blocking concurrent queues like here and with some back off strategy based on Thread.onSpinWait()
which is available since JDK 9 (and with possible fallback to Thread.yield()
for later versions of JDK using the sbt-multi-release-jar plugin)?
An archive with scalene-json.svg: scalene-json.zip
It's funny that you mention that since I was just looking into thread parking optimizations.
I have a branch that adds a simple spin-wait back-off before parking the event-loop thread: https://github.com/DanSimon/scalene/compare/0.1.3...exp/ev-backoff
Still doing some testing but I think it will mostly solve the issue. It should also greatly reduce parking caused by the LinkedBlockingQueue
. I'll take a look Thread.onSpinWait
though, it may help provide a better solution.
Also, for now I'll avoid pulling in any external non-blocking queue, since reading the LinkedBlockingQueue
is only done once per event-loop iteration and even if it's not the most efficient implementation, the overhead shouldn't really matter once we've eliminated unnecessary blocking.
Would you like to try this lightweight and efficient MPSC queue embedded immediately in sources of your actor and scheduled to handle messages on the java.util.concurrent.ForkJoinPool
?
Hey sorry I took a bit of a break from this for a while. I did a bunch of testing and was able to get some improvements in the JSON test by adding some spin-waiting, but otherwise I didn't see much of a difference. I've actually had a lot of trouble getting reliable results as I noticed that on some low-core CPU's that it's difficult to properly max out the server threads, which then causes more thread parking.
When I've tested on a GCP 64-core machine and limited Scalene to only a few threads, it looks like thread-parking becomes a much smaller percentage of CPU time. Adding spin-waiting had almost no effect in those tests. I'd still be open to trying out other implementations but at least for now I probably won't work on that.
Also, using a ForkJoinPool in Scalene actor's won't work. The actors are intentionally designed to be locked to one thread and Scalene relies on the fact that two actors share a single thread and will never run in parallel.
https://github.com/TechEmpower/FrameworkBenchmarks/tree/master/frameworks/Scala/scalene
I've tried the
/json
end-point locally withwrk2
and built flame graphs which comparescalene
withwizzardo
at fixed rate 230K and 295K msg/sec accordingly (50 connections, 2 requesting threads) on my notebook Intel® Core™ i7-7700HQ (~90% CPU usage).Scalene
Wizzardo
Here is their archive attached for interactive browsing:
flamegraphs.zip