[Feature Request] Optimize `op_http_set_response_body_text` for large Latin-1 bodies

josephrocca commented 1 year ago

I've been trying to optimize a server that is handling a very heavy load (hundreds of requests per second, >100kb response for about 20% of those requests, and the rest are tiny), and it seems like it's due to op_http_set_response_body_text:

Despite the volume of requests and the large responses I was surprised that this server is actually ~bottlenecked by HTTP (edit: maybe I shouldn't have been - this issue was originally a 5x difference before disabling Deno's default compression). Comparing to Bun, it seems like Deno.serve could be at least ~2x faster than this.

I want to emphasise that what I'm investigating here is a real-world, production server, not a benchmark. The code below looks like the sort of irrelevant micro-benchmark that you'd see people on HN arguing about, but in my case I think it replicates the issue that I'm seeing with my actual production server.

// deno run --allow-net deno.js
Deno.serve({port: 3006}, request => {
  return new Response("a".repeat(1024*1024));
});

The equivalent Bun server is 2.5x faster:

// bun run bun.js
const server = Bun.serve({
  port: 3006,
  fetch(request) {
    return new Response("a".repeat(1024*1024));
  },
});

And here's the Deno client code I'm using to test both of those servers:

// deno run --allow-net client.js
for(let i = 0; i < 1000; i++) {
  let t = performance.now();
  // set Accept-Encoding header to disable Deno compression:
  await Promise.all(new Array(200).fill(0).map(_ => fetch("http://localhost:3006/", {headers:{"Accept-Encoding":"identity"}})));
  console.log(i*100, performance.now() - t); // logs ~110ms in Bun, and ~270ms in Deno.
}

I keep the client the same, and switch between Deno and Bun, and Bun is ~2x faster.

Version info:

deno 1.36.4 (release, x86_64-unknown-linux-gnu)
v8 11.6.189.12
typescript 5.1.6

Top is Deno server, bottom is Bun server, Ubuntu 22.04:

https://github.com/denoland/deno/assets/1167575/736677f4-ca02-45d8-9f04-fe68863ab303

littledivy commented 1 year ago

What are you benchmarking here? The client or the server? The slowdown could be the client not able to keep up with the server. Use a tool like wrk or oha to stress the server.

josephrocca commented 1 year ago

@littledivy I'm using the same Deno client code in both cases - I think that answers your question? I.e. same Deno client code, and then switching between Deno server and Bun server. Bun server is ~~5x faster~~ (edit: ~2.5x faster). IIUC the only way to explain that is that Bun server is 5x 2.5x faster.

littledivy commented 1 year ago

Deno supports auto response compression, Bun does not. Notce the size/req is much lower for the Deno server. Here's the benchmark results with compression disabled (using oha):

# Deno
Requests/sec:   2459.0733

Total data: 12.01 GiB
Size/request:   1024.00 KiB
Size/sec:     2.40 GiB

 Requests/sec: 4906.2981

 Total data:    23.96 GiB
 Size/request:  1024.00 KiB
 Size/sec:  4.79 GiB

Still a lot slower but not 5x.

littledivy commented 1 year ago

Almost the same difference as a simple Hello World server.

The reason why op_http_set_response_body_text shows up in the trace here is because of compression. It is quite fast for uncompressed bodies.

josephrocca commented 1 year ago

@littledivy Ah, yep, I already have compression disabled on my production server (I actually needed to do that for a different reason), so op_http_set_response_body_text still seems to be the bottleneck.

Here's the bottom-up tree of the simple Deno benchmark with client code updated to disable compression:

Maybe the above is to be expected (not sure where the bottleneck "should" be for a simple benchmark like this) but 2x / 2.5x speed difference is definitely not to be expected imo.

I've updated the original post with the new code/results.

mmastrac commented 1 year ago

Unfortunately returning string bodies is not an optimal path for us right now as we need to do a UTF-16 -> UTF-8 conversion on the strings which involves a large copy all at once.

We'd definitely like to address this, but in the meantime are you able to test your application using Uint8Array bodies instead? We are much faster in that path as we can avoid copies altogether.

josephrocca commented 1 year ago

@mmastrac Ah I see, thanks for the explanation! Potentially dumb question: If I convert my strings to Uint8Array then wouldn't I incur the same overhead that you do on your end? In my case my strings are sometimes HTML (constructed dynamically from DB queries), sometimes JSON that's the result of JSON.stringify on large objects, and sometimes image data URLs.

As a guess, maybe you mean that I should try to go straight to Uint8Array where possible? I think that should be possible with the data URLs, but not sure how that'd work for the other types of responses.

mmastrac commented 1 year ago

If it's possible, it'll definitely be faster.

I'm thinking through some ways we can improve our performance on large strings, but it's not a trivial fix due to the APIs we have available from v8.

I don't have a timeline for when we can fully address this, but it's definitely on our radar. We may need to go upstream to v8 to get the APIs we need.

One thing to note is that you're also testing a very "happy path" in your example here -- strings that are purely "Latin-1" may have optimization paths in Bun that we don't. If I change your example to use UTF-8, I get much different results where Deno is about 3x faster than Bun:

Deno: 201 req/s (M2)

// deno run --allow-net deno.js
Deno.serve({port: 3006}, request => {
  return new Response("😀".repeat(1024*1024));
});

Bun: 67 req/s (M2)

// bun run bun.js
const server = Bun.serve({
  port: 3006,
  fetch(request) {
    return new Response("😀".repeat(1024*1024));
  },
});

I'm sure we can optimize the Latin-1 path a bit, but if you're using UTF-8 in your stack we'll unfortunately end up falling back to our slower paths for now.

josephrocca commented 1 year ago

Thanks @mmastrac! I'll see what I can optimise on my end according to your advice, and will keenly look forward to any little tricks you have to speed up op_http_set_response_body_text to help with the rest. I think my production server is basically serving almost exclusively Latin-1 (html, base64 data URLs, JSON w/ English text, numbers, arrays, etc.), which is probably why op_http_set_response_body_text seems to dominate on my server too (shown in the first screenshot).

Thanks again for your help! :pray:

mmastrac commented 11 months ago

Here's the likely plan of attack:

For any string that is Latin-1 and larger than 64kB, don't send it directly but rather take a global from it (NOTE: this may require some testing to figure out the right threshold)
In ext/http, add a V8GlobalString to ResponseBytesInner
In poll_frame, repeatedly read 1024 characters of the V8GlobalString into a stack buffer and expand that into a 64kB buffer of UTF-8 until that buffer is full, then yield that as the next frame. Save the index we read up to for the next round.
Add some basic logic to ensure that copies from the global string are cache-line aligned.

josephrocca commented 11 months ago

@mmastrac Thank you! :pray: Excited for this.

I don't know enough to comment on your plan, but RE 64kB - is this just a rough guess, or based on V8 stuff? In case it's helpful, here's the distribution of response lengths that produces the bottom-up "self times" in the original post in this thread. The x-axis are string lengths of the responses in thousands:

image (7)

So about half of the requests are under 60k characters. Mentioning just in case the 1MB responses I used in the minimal reproduction mislead you into thinking that it's only for responses at the 1MB scale where op_http_set_response_body_text becomes an issue.

mmastrac commented 11 months ago

Thanks for that note! I'll modify the proposed plan to do some perf testing around this.

billywhizz commented 11 months ago

fyi. i ran some tests on this - comparing latest deno, bun and a custom c++ runtime on same v8 minor release as deno.

from what i can see the main issue here is a difference in how JSC and v8 garbage collection work, especially when it comes to mostly ascii/latin1 strings.

if i move the creation of the string out of the request handler, then i see roughly same performance for Bun and a very minimal v8 runtime, while Deno improves to 50% of the throughput of bun.

const str = "a".repeat(1024*1024)

Deno.serve({port: 3000}, request => {
  return new Response(str);
});

deno
1230 rps

bun
2450 rps

custom v8/c++
2600 rps

interestingly, if i run Bun with the --smol command line switch, then Bun throughput decreases from 2450 rps to 760 rps (with much reduced overall memory usage).

this is the only info i can find about this command line switch for Bun/JSC without digging into the JSC source. but it seems the default/large JSC heap handles this particular scenario better than v8?

i think there are still some improvements that could be made inside the Deno Response handler, but it's also possible Bun is not conforming to spec here while Deno is - something to keep in mind.

I will also have a look later today at whether this remains an issue in a more realistic scenario. I am suspicious there is some kind of caching/re-use of these static strings under the hood on the Bun/JSC side.

PS - i also tried pretty much all of the v8 gc-related flags and couldn't find anything that made a difference for this particular scenario.

billywhizz commented 11 months ago

I've done some further testing on this and, as far as I can see, this is down to differences in UTF-8 encoding of strings between Bun/JSC and Deno/Node/V8.

I posted some benchmarks here, which you should be able to run without much difficulty. There is a google sheet here with results of a run I did on my laptop. It's quite interesting I think.

The code being run is essentially the same code in the hot path of the web server benchmark above

for (let i = 0; i < runs; i++) {
  encoder.encodeInto('h'.repeat(size), dest)
}

These are the results from a run with latest Bun, Deno and a custom v8 runtime for comparison to Deno and Node.js.

bun-deno-node-spin-utf-8-comparison

This is basically the code the C++ runtime uses - it's not very safe of course but it shows what I think would be the current practical limit for any v8 based runtime.

void spin::Utf8EncodeInto(const FunctionCallbackInfo<Value> &args) {
  Isolate *isolate = args.GetIsolate();
  Local<String> str = args[0].As<String>();
  if (str->IsOneByte()) {
    int size = str->Length();
    Local<Uint8Array> u8 = args[1].As<Uint8Array>();
    uint8_t* dest = (uint8_t*)u8->Buffer()->Data() + u8->ByteOffset();
    int written = str->WriteOneByte(isolate, dest, 0, size, 
      String::NO_NULL_TERMINATION);
    args.GetReturnValue().Set(Integer::New(isolate, written));
    return;
  }
  int written = 0;
  int size = str->Utf8Length(isolate);
  Local<Uint8Array> u8 = args[1].As<Uint8Array>();
  char* dest = (char*)u8->Buffer()->Data() + u8->ByteOffset();
  str->WriteUtf8(isolate, dest, size, &written, 
    String::NO_NULL_TERMINATION | String::REPLACE_INVALID_UTF8);
  args.GetReturnValue().Set(Integer::New(isolate, written));
}

int32_t spin::fastUtf8EncodeInto (void* p, struct FastOneByteString* 
  const p_str, struct FastApiTypedArray* const p_buf) {
  memcpy(p_buf->data, p_str->data, p_str->length);
  return p_str->length;
}

As you can see, on the fastcall path, because we know we are dealing with a latin1 string and don't have to validate, I am just doing a memcpy.

Some things that jump out from the results.

For the smallest payload sizes (1, 2, 4, 8), the C++ method above is significantly faster than any of the others.
For payload sizes 1 and 2 bytes, there is no meaningful difference between Bun and Deno/Node.js, but Bun goes faster for payload sizes 4 and 8
At size 16, we see a significant drop in throughput for all the v8 runtimes
This advantage for Bun/JSC is apparent for all sizes from 8 to 1024 bytes inclusive
Between 2048 and 65536 bytes, inclusive, the V8 based runtimes do better than Bun/JSC
At 131072 bytes there is a huge drop in throughput for all the V8 runtimes and Bun significantly outperforms them on all the biggest payload sizes up to 1 MB.

So, while I think there are probably some tweaks that could be made (if possible while adhering to specifications) to Deno to improve the perf at the lower sizes, it seems there is something about how v8 is doing the utf8 encoding that is causing a big bottleneck for ranges 16 - 1024 (inclusive) and 131072 to 1MB (inclusive).

Maybe @littledivy's recent patches to v8 can help here once they land in rusty-v8 and deno? I should be able to do some testing with this in coming days.

This was the max memory usage of the runtimes when I ran the benchmark

runtime	max-rss
bun	511284
bun-smol	116532
deno	85724
node	72808
spin	53940
just	33044

The bun-smol run is from running bun with the --smol flag which uses a different heap designed to reduce memory usage. We can see Bun with the standard JSC heap uses a lot of memory compared to other runtimes and when we use the --smol flag to reduce memory usage then it performs much worse almost across the board.

Hopefully this is useful. I'll do some further digging into the v8 and JSC/Bun code to see if i can find any obvious difference in how UTF-8 encoding is done and maybe the v8 bultins from @littledivy might be useful here.

If anything looks off in my methods or results please let me know - I did all this in quite a hurry. :sweat_smile:

@lemire you might be interested in this.

lemire commented 11 months ago

I love this issue, thanks for pinging me @billywhizz

Surely payloads of a few bytes must be uncommon?

billywhizz commented 11 months ago

Shortly after posting this I realised (again) it must be a GC issue - the fact I am only doing a memcpy in the fastcall but still seeing the slowdown versus Bun/JSC in those two ranges must point to something in GC/Heap allocation, which is what I had assumed originally? :thinking:

Would it be worth testing with a different allocator?

@lemire yes - the bench is artificial for sure, but i just wanted to see if there were different behaviours at different sizes, and it seems there are.

josephrocca commented 1 month ago

Unfortunately returning string bodies is not an optimal path for us right now as we need to do a UTF-16 -> UTF-8 conversion on the strings which involves a large copy all at once. [...] I'm thinking through some ways we can improve our performance on large strings, but it's not a trivial fix due to the APIs we have available from v8. I don't have a timeline for when we can fully address this, but it's definitely on our radar. We may need to go upstream to v8 to get the APIs we need.

Guessing/hoping this is relevant to this issue:

New v8::String::ValueView API in V8 12.8 allows zero-copy access to JS string contents. Deno and Node will likely see HTTP server throughput improvements from this. - https://x.com/rough__sea/status/1815768577300889921

denoland / deno

[Feature Request] Optimize `op_http_set_response_body_text` for large Latin-1 bodies #20409