Closed ajusa closed 2 years ago
Nice! I've been doing some benchmarking and it looks like the mark and sweep GC performs significantly better. I was able to get current HEAD to 1.5 million QPS. With this PR this goes up to 1.65 million QPS! This is definitely a hot path.
I think there is even more room here for improvements, we can get rid of allocating this string completely. I'm currently playing with this: https://gist.github.com/dom96/a041fabecd346579744c3b78ba599ec9. With the following results:
name ............................... min time avg time std dv runs
% formatting ...................... 39.835 ms 40.321 ms ±0.164 x124
concating ......................... 15.964 ms 16.314 ms ±0.139 x307
smart concating ................... 11.044 ms 11.285 ms ±0.069 x443
pre-alloc concating ................ 5.884 ms 5.983 ms ±0.043 x835
This will require some special casing in httpbeast for small responses, but I'm excited to see how much faster it will make it.
Yeah, the tricky bit is finding the things on the hot path - after that, we can just extract out the bit of code that we need, and use benchy to write a faster version. The only exception to this are the OS/system calls. We don't have as much insight into those (such as the performance metrics for them).
If there are other hot paths anyone is able to find, opening an issue would be a good first step to getting others to optimize the code!
Inspired by #63. I noticed that a decent chunk of time is spent in string allocation, probably because each time we add onto the string past a certain limit Nim needs to reallocate it and get a bigger string. In this case, since we know the sizes of most of the things we are allocating we can simply preallocate the appoximate size of the buffer we'll need.
I used ~30 to refer to the number of bytes inside of the quotes. There's a bit of headroom there due to the HTTP code size in bytes, but it shouldn't actually affect performance all that much. This probably improved htppbeast performance on my machine by a few percent?
This method for preallocating is about 25% faster than what we have there right now from that last PR though: