Discussion point - Buffer recycling on readBuffer but not writeBuffer (maybe)

rbygrave commented 2 years ago

I could be wrong or low on caffeine but it looks to me like there is effective buffer recycling with readBuffer but not with writeBuffer at the moment? Am I reading that incorrectly or maybe it doesn't matter?

That is, it looks like with writeBuffer there is a decent amount of copying in Response.merge() and then wrap(). I get a sense we could avoid the byte[] result = new byte[size]; and look to recycle writeBuffer instead?

rbygrave commented 2 years ago

Along the lines of https://github.com/ebarlas/microhttp/pull/5

ebarlas commented 2 years ago

Great question and thanks for the explanatory PR.

There is subtle asymmetry between reading and writing in this regard.

With reading, there is one shared off-heap buffer that serves all connections. It can be quite large, since there is only one. As a result, Microhttp can pull large buffers out of the network efficiently. And every time a connection has data available, it is loaded into that buffer and then copied into a targeted ByteTokenizer, which grows only as needed, for the given connection.

With writing, there is no easy way to do the equivalent. With your change, every connection has a 64K buffer just sitting idle at all times. Recall one of my benchmarks has 50,000 persistent connections. That's a lot of unused heap space! The goal is to strike the right balance between performance, efficiency, and scalability. In addition, since responses are effectively prepared in memory by the application, I'd like to avoid putting a hard limit on the size in Microhttp if possible.

I'm not sure if that was the best explanation. I'm curious to hear your thoughts.

rbygrave commented 2 years ago

Very good explanation.

every connection has a 64K buffer

Yes I agree that is massive !!!! ... and I'd prefer something different too.

With respect to response body and writeBuffer I have something more like ...

writeBuffer is split into writeHeaderBuffer and writeBodyBuffer (this stays per connection)
The writeBodyBuffer is exposed to handler code in the form of OutputStream (and maybe Writer)
Handler code (say serialising a json response) then gets to write directly (via the "response OutputStream") to the writeBodyBuffer (so it doesn't need to allocate a buffer itself and doesn't need the extra copy of response bytes to writeBodyBuffer)
The writeBodyBuffer is "smallish" but can maybe grow on demand if needed
When the handler code has finished writing the response it determines content length, writes headers to writeHeaderBuffer, writes both writeHeaderBuffer and writeBodyBuffer to the channel
Both writeHeaderBuffer and writeBodyBuffer are recycled (effectively per request)

As an example, lets say the writeBodyBuffer is 4Kb and my handler code generally returns 2Kb or less of json content. This should mean it gets to write that response content without needing to do any extra memory allocation or extra copying of the response bytes (per request).

ebarlas commented 2 years ago

That's a reasonable idea.

However, I'm skeptical about the performance gains that would bring. Since Microhttp is limited to discrete requests and responses, the current approach seems like a good balance. Yes, there is an extra copy, but the cost is relatively low for small or medium sized payloads.

I'm worried that providing a stream-oriented abstraction would send a confusing message since fundamentally Microhttp does not support streaming. It will parse and buffer chunked transfer requests, but it never chunks responses.

rbygrave commented 2 years ago

skeptical about the performance gains

That is good and proper. It is kind of up to me to prove it via performance testing.

Performance testing with/without the change
Performance testing against jdk httpserver, jetty, undertow, netty, grizzly

Background: I'm just publishing what I believe will be the 2nd fastest java json parsing/generation library. Well, a chunk of it is a refactoring of the fastest library - so some folks will say I'm taking the fastest library and making something slower :) Anyway, the relevant part is that "buffer recycling" has a major performance impact at the top end / for the fastest ones. "Encoding cost" is obviously the other interesting part.

So my theory is that we should be able see measurable performance difference with:

Effective buffer recycling for the "buffer that handlers write their output to"
Reducing response body copy operations

... given a "decent" amount of response body, say initially testing at 1kb, 2kb, 4kb and going from there

My gut says this should relatively outperform jdk httpserver because I know it's "not fast" and I know it doesn't do buffer recycling or optimise encoding/decoding headers and the other http servers do those things. The interesting bit is to performance compare against Jetty, Undertow, Netty. The interesting (for me) experimental part is that with so little code and so few moving parts it will be reasonably easy to experiment with things like "header encoding/decoding" costs. As in, how fast can this go if we already have the known/common headers already encoded to byte[] and things like writing headers become mostly byte[] copy etc.

worried that providing a stream-oriented abstraction would send a confusing message

Yeah I buy that. Additionally I know that for example all the fastest json libraries have their own buffer recycling so it could distill down to ... use buffer recycling of the json library and we just have the extra copy cost. So I'll try and see if I can measure that.

That said, a bunch of existing libraries will have support for "serialise to OutputStream/Writer".

... but firstly need to see what can be measured here.

ebarlas commented 2 years ago

Thanks for the detailed explanation. I'm curious to see what you come up with.

ebarlas commented 2 years ago

I've already been doing some profiling, so I decided to turn my attention to the question of response serialization cost.

I profiled this Profile.java application using the Async Profiler in IntelliJ.

It starts a Microhttp event loop, establishes a persistent connection, and exchanges 50,000 requests and responses.

I tried different response sizes ranging from 1K to 16K. The results are all roughly the same my machine: about 6-7% of the event loop thread CPU time.

Here are two Async Profiler flame graphs, the first looking at the event loop thread broadly and the second looking at Response.serialize().

The universal law of profiling surprise seems to hold here. Apparently ArrayList.iterator() is taking 55% of the CPU time Response.serialize(). Very strange!

System.arraycopy() is only 1.7% of Response.serialize().

It seems to be confirming my suspicion that small buffer copies are very fast.

Screen Shot 2022-02-17 at 8 46 13 AM Screen Shot 2022-02-17 at 8 55 39 AM

rbygrave commented 2 years ago

I did a quick comparison to Jetty using "hello" plain text response, 100 concurrent clients 1,000,000 requests. Microhttp at 60k rps, Jetty at 90k rps [Edit: jdk httpserver at 27k rps]. Need to do more investigation in a few directions.

Edit: Initial tweaks to writing response have not moved the needle, next step would be to look at the request side.

ebarlas commented 2 years ago

I see a comparable split when I run the plain-text hello world benchmarks from TechEmpower FrameworkBenchmarks (I submitted a PR for Microhttp).

Microhttp:

[ec2-user@ip-10-39-196-99 wrk]$ ./wrk -t1 -c100 -d10s http://10.39.196.164:8080/plaintext
Running 10s test @ http://10.39.196.164:8080/plaintext
  1 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   815.49us  356.19us  27.11ms   92.45%
    Req/Sec   118.29k     3.97k  122.84k    81.00%
  1176838 requests in 10.00s, 150.39MB read
Requests/sec: 117669.56
Transfer/sec:     15.04MB

Jetty:

[ec2-user@ip-10-39-196-99 wrk]$ ./wrk -t1 -c100 -d10s http://10.39.196.164:8080/plaintext
Running 10s test @ http://10.39.196.164:8080/plaintext
  1 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   417.24us    1.69ms 204.08ms   99.83%
    Req/Sec   166.76k     2.19k  169.39k    91.00%
  1659062 requests in 10.00s, 234.17MB read
Requests/sec: 165891.54
Transfer/sec:     23.41MB

ebarlas commented 2 years ago

A couple other useful observations from the test above. The Jetty process used about 10x more CPU time as measured by ps (00:01:03 vs 00:00:13 after wrk completes).

My current theory is simply that Jetty is multi-threaded. From profiling, it's clear that multiple Jetty threads perform socket I/O concurrently.

I'm curious to see a performance comparison in a more constrained environment with fewer CPU cores.

Microhttp:

COMMAND                                                                                              %CPU  C  CP     TIME     TIME   PID %MEM   RSS   DRS  TRS    VSZ
./jdk-17.0.2/bin/java -cp microhttp-example-0.1-jar-with-dependencies.jar hello.HelloWebServer 8080  75.2 75 752 00:00:13 00:00:13 15872  1.5 243096 7403900 3 7403904

Jetty:

COMMAND                                                                                              %CPU  C  CP     TIME     TIME   PID %MEM   RSS   DRS  TRS    VSZ
./jdk-17.0.2/bin/java -jar jetty-example-0.1-jar-with-dependencies.jar                                354 99 999 00:01:03 00:01:03 16199  1.5 237244 9989876 3 9989880

ebarlas commented 2 years ago

I committed a small change to RequestParser to use a couple raw loops rather than regular expression patterns. ByteTokenizer already does most of the splitting and the regex pattern utility wasn't providing much.

I also committed a change to avoid invoking Selector.wakeup() system call if not strictly required.

Here's the resulting profiling results on my machine. Again, using Profile.java.

This is fast-approaching optimal distribution of CPU time.

80% of CPU is spent on these 3 system calls:

43% sun.nio.ch.SocketChannelImpl.read(ByteBuffer)
30% sun.nio.ch.SelectorImpl.select(long)
7% sun.nio.ch.SocketChannelImpl.write(ByteBuffer)

Screen Shot 2022-02-21 at 12 17 43 PM

rbygrave commented 2 years ago

Nice.

performance comparison in a more constrained environment with fewer CPU cores

We could use docker to locally run load with resource constraints - especially CPU e.g. 0.5, 1, 1.5, 2 ... (maybe there an inflection point in there for RPS)

rbygrave commented 2 years ago

FYI: I have a hack / prototype hitting 100K rps.

It has:

A single ["Listener"] thread doing accept - onAcceptable would "assign a connection" to one of the workers (round robin).
Fixed number of "Worker threads" that do the onReadable/onWritable [the number of workers I'd expect to be based on Runtime.getRuntime().availableProcessors() when done properly, a Worker has it's own Selector].
"Connections" are owned/assigned to a worker
Once the request has been read and parsed the Handler is executed (submitted to a separate Executor* which after execution puts the response back to the associated workers selector)

Executor* - ultimately I expect this Executor to be Loom based.

This approach might be good IF it scales down nicely in low resource environments (less than 1 core). For example uses 1 Worker when there is 1 processor (or Listener & Worker are in fact the same in that case). It should also scale when there are LOTS of connections with the use of Loom virtual threads.

Note: This Listener + Worker pool type setup is pretty common, I'm thinking you will be aware of it. For 8 processors, max rps with 2 worker threads. Next step is to tidy up this code and remove some hacks etc.

Also, pretty sure you'd be interested in: https://github.com/bbeaupain/nio_uring

ebarlas commented 2 years ago

Thanks for sharing that. I've thought about doing something similar just to see how Microhttp measures up given the freedom of multi-threading.

However, I'm fairly committed to the single-threaded approach in Microhttp. It's a large part of what makes it simple, intuitive, and lightweight.

The single-threaded approach makes Microhttp fall short in benchmarks with no application workload. But in realistic environments with an application workload, the single-threaded approach of Microhttp ought to leave more spare cycles for the application to do its work. It would be interesting to see a comparison with a bit of computational work, like a cryptographic hash, scatter-gather networking, etc.

ebarlas / microhttp

Discussion point - Buffer recycling on readBuffer but not writeBuffer (maybe) #4