caddyserver / caddy

Fast and extensible multi-platform HTTP/1-2-3 web server with automatic HTTPS
https://caddyserver.com
Apache License 2.0
58.45k stars 4.04k forks source link

High GC pressure causing slowdowns (was: gzip awfully slow) #735

Closed kennylevinsen closed 7 years ago

kennylevinsen commented 8 years ago

1. What version of Caddy are you running (caddy -version)?

Caddy 0.8.2

2. What are you trying to do?

Simply running a comparison between my own, simple Go web server and Caddy in various modes.

My tests were run with https://github.com/rakyll/boom:

boom -n 10000 -c 100 http://mysite/index.html

3. What is your entire Caddyfile?

0.0.0.0:8080
root /somewhere
log /var/log/caddy.log
gzip

4. How did you run Caddy (give the full command and describe the execution environment)?

./caddy -conf Caddyfile

5. What did you expect to see?

A total execution time of the test somewhere around 12s±5%, which is the speed it takes for my own Go server to respond to the test when set up to serve from disk with gzip enabled (which SHA256 and gzips the content on every request).

6. What did you see instead (give full error messages and/or log)?

A total execution time of 39.6s±5%. That's a lot. The same run against Caddy without gzip enabled on the server takes 6.8s±5% in comparison. That's 3 times slower than my server, and while I suspected that my server would be faster, I had thought a few percent, not factor 3.

I don't use Caddy myself, but I thought I'd give you a heads up about my findings. I have not investigated why Caddy is so slow in this case, but it should be possible to get similar speeds to my server, which isn't magic when serving data from disk.

mholt commented 8 years ago

Can you run your test again without the log directive? I feel like that's going to add noise to your tests. (Also you don't need -conf Caddyfile as that is the default.)

kennylevinsen commented 8 years ago

A single run with gzip, but without access logs: 37.72s. Note that the non-gzip run at ~6.5s also included access logs.

kennylevinsen commented 8 years ago

(But as a performance test against my own server, I wanted it to include access logs. Otherwise, it wasn't comparing real setups.)

kennylevinsen commented 8 years ago

Something I just noticed: Caddy uses level 9 by default, not the gzip packages default. For the 10KB input file, Caddy outputs 4498 bytes, while the default level my own server uses output 4563, which is an almost unnoticeable difference. I'll try to run with Caddy set to a lower level, although the docs don't seem to state exactly what the default level is...

UPDATE: Tried with level 6 using the following Caddyfile:

0.0.0.0:8080
root /somewhere
gzip {
   level 6
}

But the output size does not change, nor does the execution time. It does not seem to be obeying.

mholt commented 8 years ago

Thanks for trying a few things. We'll look into this soon, unless somebody wants to dive deeper sooner. (I haven't spent a lot of time in optimizations yet since I'm still trying to get the code base stable.)

I agree we probably don't need the highest compression level; could someone do some profiling (pprof?) and see where the CPU time is spent?

weingart commented 8 years ago

ISTR that gzip level 5 or 6 was where it was tuned for speed vs. content size tradeoff. At level 9, it's going to be a lot slower.

-Toby.

On Sat, Apr 9, 2016 at 12:49 PM, Matt Holt notifications@github.com wrote:

Thanks for trying a few things. We'll look into this soon, unless somebody wants to dive deeper sooner. (I haven't spent a lot of time in optimizations yet since I'm still trying to get the code base stable.)

I agree we probably don't need the highest compression level; could someone do some profiling (pprof?) and see where the CPU time is spent?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/mholt/caddy/issues/735#issuecomment-207843261

xenolf commented 8 years ago

@joushou What was the payload you used to get those 30 seconds?

kennylevinsen commented 8 years ago

@xenolf: my index.html (from https://kl.wtf/). 10KB html, gzips to about 4.5KB. Nothing special. It's also ~40s, not ~30s, to make matters worse. :)

@weingart: I tried level 6 (config in previous comment) which is the Go default, and what I use in my own server that I compared against. I didn't see a change in time nor filesize, so maybe the setting is broken?

xenolf commented 8 years ago

@joushou I used the caddyfile you provided as well as your index.html to run several test using the boom settings you supplied. I failed however to come anywhere near the numbers you are reporting at all. Not even when enabling CPU profiling. The numbers are stable at around 12 seconds. The tests were executed using Go 1.6 on a 4 core x64 debian 8.2 machine with an allocated 8 GB of ram with caddy 8.2 as well as the latest master.

mholt commented 8 years ago

@joushou Levels 6 and 10 are known to have ~equivalent compression ratios for most things. Also, 10 KB is hardly large enough of a file to see much of a difference in compression time by using a benchmark done over the network.

(PS. gzip's default compress level is 6, which is the default in the compress/flate package.)

Is there any more information you could give us to help us reproduce the results you're seeing?

weingart commented 8 years ago

I think there is something else going on here. Possibly boom is not reading the whole body, and ends up doing a re-connect with gzip, but not without it?

kennylevinsen commented 8 years ago

@weingart boom does not have this behaviour when running against my own server when it serves gzipped content from memory (which is its normal operating mode), where the test completes in ~3.5s. This is not a fair comparison with Caddy, so I used a from-disk folder on my server, which causes it to read the file and gzip on every request, to compare similar workloads. In this case, my server completed the test in 12s yesterday.

@mholt Yes, but there is a slight difference in the compression from caddy set to level 6 and my own server at Go's default, which is level 6. Not big, the downloaded file should be exactly the same, no?

@xenolf now that is peculiar. Only noticeable difference is that the server I used to run the web server is a KVM instance with only a single core and 1GB RAM running Arch Linux.

I set up another Linode 1024 in frankfurt, which is identical to the server I used initially. One peculiar thing is that they behave quite differently. Tested today against both boxes:

Old box:
   minihttp: 16.3s (was 12s over several runs yesterday)
   Caddy:    26.7s (was 37s over several runs yesterday)

New box:
   minihttp: 20s
   Caddy:    25s

Same VPS type and dist, and apart from a bit user configuration, the servers are identical with nothing else running on them. I suppose the loads the hosts are seeing are different, but I cannot quite explain why the gap is consistently lower on the new box. it runs the same binaries, serving the same file, after all.

I'd like to provide you with the address for both, as well as login for the new box in case you would like to verify against my test conditions. Any place I can send you the info in private?

xenolf commented 8 years ago

@joushou Sure, you could message me on gitter if that's fine with you.

kennylevinsen commented 8 years ago

Ah, I made a small mistake in the new server configuration: I pointed Caddy at a non-existing directory. The 25s was therefore the time to serve a 13 byte error message, not the gzipped response. I'd say that this is still a LONG time to serve an error message, but that's an issue for another day!

Pointing Caddy to the proper directory replicates the >30s response times.

@xenolf Yup, fine by me.

xenolf commented 8 years ago

I used the box provided by @joushou to run a profile on. The profile showed that the problem has nothing to do with the GZIP code at all. The GZIP code performs comparable between the two. The underlying issue seems to be high GC pressure caddy is generating under these circumstances which shows especially on a low end box. I will investigate further.

xenolf commented 8 years ago

The difference in behavior between these two servers comes from the way how both handle files internally. Caddy streams files using FileServer from disk. This causes many small reads / copies as it is reading chunks of data from disk and then GZIPing those chunks which increases GC pressure due to small allocations. minihttp on the other side reads the whole file in one go and is then GZIPing the whole buffer.

The second approach will indeed perform much better overall while handling small files and especially so in a low performance environment as it is less punishing on resources. However, the more you increase file size, the more this approach suffers in both raw performance as well as in memory consumption and response time. The approach of minihttp has by definition a much higher time to the first byte sent while caddy is able to respond right after the first chunk was gzipped.

To conclude my thoughts... The approach minihttp takes shows that it vastly outperforms chunk based streaming for small files where caddy is using the better option for larger files. The optimal solution to this would involve taking a middle way where we implement our own "FileServer" and set a certain size threshold for the chunk based file serving and files smaller then this threshold get read into memory and GZIPed as a whole.

Now on to some charts...

Please keep in mind that the following graphs and numbers are coming from a microbenchmark which in no way is representative of actual application performance but are here to illustrate the difference in implementations.

The machine used was a dual core x64 archlinux box with 16GB of RAM as well as the following configurations for caddy / minihttp:

Caddyfile

0.0.0.0:8080
root root/
log caddy.log
gzip

minihttp.toml

root = "./web"
logFile = "minihttp.log"
logLines = 16384

[http]
        address = ":9090"

For minihttp, the "fancy folder" feature was used to reload the file from disk on every request to get a near even playingfield.

1MB File

Increasing the file size to 1MB already shows this in caddy outpacing minihttp in req/s as well as memory consumption. Please keep in mind that req/s here mean HTTP 200 responses from the server, not a fully received file.

Caddy used ~150MB of ram while doing ~300 req/s caddy_1mb_60s

⇒  ./wrk -H "Accept-Encoding: gzip" -t 100 -c 100 -d 60s http://192.168.10.12:8080/index.html
Running 1m test @ http://192.168.10.12:8080/index.html
  100 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   349.55ms  240.02ms   1.89s    73.74%
    Req/Sec     4.03      3.47    50.00     83.07%
  18142 requests in 1.00m, 101.62MB read
  Socket errors: connect 0, read 0, write 0, timeout 5
Requests/sec:    301.87
Transfer/sec:      1.69MB

while minihttp goes up to ~200MB while doing ~163 req/s. minihttp_1mb_60s

⇒  ./wrk -H "Accept-Encoding: gzip" -t 100 -c 100 -d 60s http://192.168.10.12:9090/f/index.html
Running 1m test @ http://192.168.10.12:9090/f/index.html
  100 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   617.89ms  359.71ms   1.99s    67.67%
    Req/Sec     2.00      2.27    20.00     87.98%
  9805 requests in 1.00m, 55.45MB read
  Socket errors: connect 0, read 0, write 0, timeout 25
Requests/sec:    163.14
Transfer/sec:      0.92MB

2MB File:

Caddy: caddy_2mb_60s

minihttp: minihttp_2mb_60s

mholt commented 8 years ago

Excellent analysis, @xenolf. Thanks for diving into that while my local copy of Caddy is all gutted out :smile:

It seems the initial report applies only to one specific case: gzipping lots of tiny files in a server that's under very high load and has low-end specs.

I think our current approach is probably best for the general case: Caddy won't crash with high memory usage because it remains fairly constant thanks to streaming. As noted, it does increase allocations significantly with lots of little gzipping going on. A caching plugin could be written which stores small files in memory. There is discussion about that in #10 that we need not duplicate here.

With that properly implemented, we can achieve a good balance of both scenarios: judicious memory use and higher speeds.

@joushou Thanks for bringing this up; would love to have help with your expertise in implementing such a middleware. Again, issue 10 is a good place to take that discussion.

kennylevinsen commented 8 years ago

@xenolf thanks for your analysis. I had not put much thought into the fact that I zip in one go at all. I do so mainly because I'd like better Etags (sha256), as well as to partially reuse the mechanics for from-memory serving. The fancy folder from-disk thing was a hack I made when I had to upload a 20MB fil - I never intended to do anything from disk. :)

One note: would the microbenchmark be affected by the additional work involved with the SHA I run on those 1/2MB tests? They're done up front for in-memory payloads, but recalculated constantly for from-disk... If a benchmark of the file reading technique, not the server is of interest, the call to sha256.Sum256 in resource.updateAll could be killed. I suspect it artificially increases the time-to-first-byte noticably, and that for a few MB size files, that the main issue becomes high memory consumption.

If not mistaken, writing a fileserver that just used larger chunks would fix this without having multiple pathways. Say, reading into a 512k slice, sending it along the chain when full or io.EOF. It is, of course, just an implicit threshold. Maybe a middleware that pauses to increase chunk sizes would even so the trick, albeit not as optimally.

I take it that my server for testing is no longer needed?

kennylevinsen commented 8 years ago

@mholt I'll take a look. It won't be as simple as minihttp, as minihttp is essentially a cache with infinite cache time, wheras a middleware might have to act more like a full caching proxy (just a very close one) to not mess with lower layers, which would require them to state their cachiness properly. Either that, or something more custom with a few knobs in the caddyfile that can easily be set very wrong...

There are only 3 hard problems in computer science: Cache invalidation and naming things. :)

kennylevinsen commented 8 years ago

@mholt another note while on the train: I suspect that this is not really isolated to gzip. I have not seen the profile, but I suspect that gzip jus magnified the issue enough that my small test machine could not keep up. The smaller the chunks the higher the overhead - more object allocations, more middleware roundtrips that might allocate on their own, net/http runs and syscalls. The larger chunks the better, up to some magic threshold. There's a big dfference between 4K chunk writes and 256/512K chunk writes, and I believe that this issue should be considered seperate from lack of cache, despite that a cache will hide it.

mholt commented 8 years ago

@joushou It's possible. Best I can tell so far, the buffer size is 32 KB (http.ServeContent calls io.CopyN which calls io.Copy which calls copyBuffer) -- although a custom-sized buffer can be used, I'm not sure how to change it at a level as high as gzip or net/http. Anyone know how to do this?

Also, not sure about the SHA performance; you'd have to profile/bench that.

kennylevinsen commented 8 years ago

Changing that chunk-size requires opening an issue against http.ServeContent to use io.CopyBuffer directly, in order to permit custom buffer sizes. The fix from Caddy's side is not to use http.ServeContent, which is the solution @xenolf mentioned: A custom file server. I believe the issues will show with any middleware that must process content, and is likely to also cause issues in the future with other middlewares and high-load scenarios.

An important aspect to note, however, is that a too big chunk size will cause high memory consumption and CPU time for people trying a DoS, but having it too low will cause even greater CPU time through excessive GC time. One must balance the thing. I would consider high GC pressure a problem, but the priority is entirely up to you.

As for SHA256, I was misreading some of @xenolfs numbers. I had erroneously seen the difference as being much smaller, so ignore that. SHA256 might not be the fastest hash in the world, but it's certainly not slow enough to compensate for huffman encoding over 992KB more data than Caddy does to get the first byte out the door. :)

Anyway, I'll experiment a bit with some techniques and chunk sizes inside minihttp for from-disk file delivery and report back with my findings. As for the cache, I might take a look at #10.

kennylevinsen commented 8 years ago

Okay, I have played with different things in minihttp. I suspect that the only difference in GZIP handling was that mine was terrible, causing poor performance overall. I have now implemented from-disk gzip serving using an io.Copy from the bare os.File into a gzip.Writer, which is the same as Caddy does inside of http.ServeContent. I do, however, not experience the same GC pressure on the tiny test machine @xenolf used for testing as Caddy does, but have instead gained a performance boost it seems.

Here are some tests with my new, Caddy-like handling (joushou/minihttp@73d471a5fa19a814dae080a5544ffdd35fddacdf, mholt/Caddy@c21ff8343c8bb5d290002057c9ec322df59c18cf). The test is performed over a home internet connection to a VPS, but the numbers are stable. One thing of note is that I have twice the maximum latency for some reason, yet still a third lower average latency. Not sure of that is of any importance.

minihttp (10KB input):

wrk -H "Accept-Encoding: gzip" -t 100 -c 100 -d30s http://.../f/index2.html
Running 30s test @ http://.../f/index2.html
  100 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   219.80ms  125.27ms 790.48ms   72.60%
    Req/Sec     5.62      3.41    30.00     74.78%
  14040 requests in 30.10s, 56.77MB read
Requests/sec:    466.42
Transfer/sec:      1.89MB

Caddy (10KB input):

wrk -H "Accept-Encoding: gzip" -t 100 -c 100 -d30s http://...f/index2.html
Running 30s test @ http://.../f/index2.html
  100 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   328.92ms   82.95ms 449.60ms   83.36%
    Req/Sec     3.06      1.86    20.00     92.27%
  9067 requests in 30.10s, 36.99MB read
Requests/sec:    301.26
Transfer/sec:      1.23MB

minihttp (1MB input):

wrk -H "Accept-Encoding: gzip" -t 100 -c 100 -d30s --timeout 20s  http://.../f/blob
Running 30s test @ http://.../f/blob
  100 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.70s     1.85s   12.58s    67.26%
    Req/Sec     0.00      0.00     0.00    100.00%
  394 requests in 30.11s, 444.35MB read
Requests/sec:     13.09
Transfer/sec:     14.76MB

Caddy (1MB input):

wrk -H "Accept-Encoding: gzip" -t 100 -c 100 -d30s --timeout 20s  http://.../f/blob
Running 30s test @ http://.../f/blob
  100 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     8.36s   838.61ms  10.85s    66.23%
    Req/Sec     0.00      0.00     0.00    100.00%
  305 requests in 30.11s, 361.90MB read
Requests/sec:     10.13
Transfer/sec:     12.02MB

So, I don't think we have found the issue quite yet, but thanks for the hint to make my own from-disk file handling better. :)

SergeAx commented 7 years ago

I confirm a catastrophic performance loss (10x times compared with disabled gzip) in a setup with small (0.5k) files on very low-end machine (32 bit 4-core ARM Marvell Armada 370/XP, 2GB RAM). go version 1.7.4 linux/arm. Hope it will be fixed soon. Apart from that, thanks for great work!

mholt commented 7 years ago

It seems this doesn't affect everyone -- if this is affecting you right now, a decent workaround would be to pre-compress your static files and put them in your site with .gz extensions; Caddy will serve them automatically (no gzip directive required; placing the files there is enough).

commarla commented 7 years ago

Hey,

We encounter the same issue. With caddy 0.9.5, we tried to serve a simple javascript (17Ko).

Caddyfile :

*:80 {
  root /space/iadvize.com/livechat/
  gzip {
    level 6
  }
  log off
  tls off

With gzip level 6 enabled

root@iadvizeweb01:~# hey -n 10000 -c 10  "http://localhost/iadvize.js"
684 requests done.
1370 requests done.
2068 requests done.
2776 requests done.
3488 requests done.
4185 requests done.
4881 requests done.
5567 requests done.
6259 requests done.
6954 requests done.
7657 requests done.
8352 requests done.
9040 requests done.
9737 requests done.
All requests done.

Summary:
  Total:    7.2252 secs
  Slowest:  0.0225 secs
  Fastest:  0.0016 secs
  Average:  0.0071 secs
  Requests/sec: 1384.0541

Status code distribution:
  [200] 10000 responses

Response time histogram:
  0.002 [1] |
  0.004 [653]   |∎∎∎∎∎∎∎∎
  0.006 [2779]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.008 [3076]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.010 [2041]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.012 [975]   |∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.014 [341]   |∎∎∎∎
  0.016 [106]   |∎
  0.018 [20]    |
  0.020 [7] |
  0.022 [1] |

Latency distribution:
  10% in 0.0041 secs
  25% in 0.0052 secs
  50% in 0.0068 secs
  75% in 0.0087 secs
  90% in 0.0107 secs
  95% in 0.0120 secs
  99% in 0.0147 secs

Without :

root@iadvizeweb01:~# hey -n 10000 -c 10  "http://localhost/iadvize.js"
6714 requests done.
All requests done.

Summary:
  Total:    0.7422 secs
  Slowest:  0.0107 secs
  Fastest:  0.0002 secs
  Average:  0.0007 secs
  Requests/sec: 13473.0568
  Total data:   168580000 bytes
  Size/request: 16858 bytes

Status code distribution:
  [200] 10000 responses

Response time histogram:
  0.000 [1] |
  0.001 [8919]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.002 [974]   |∎∎∎∎
  0.003 [65]    |
  0.004 [17]    |
  0.005 [14]    |
  0.006 [0] |
  0.008 [1] |
  0.009 [3] |
  0.010 [3] |
  0.011 [3] |

Latency distribution:
  10% in 0.0003 secs
  25% in 0.0004 secs
  50% in 0.0006 secs
  75% in 0.0009 secs
  90% in 0.0013 secs
  95% in 0.0016 secs
  99% in 0.0023 secs

The ratio is huge, 10x between gzip enable or not.

wendigo commented 7 years ago

@commarla I suggest using pre-gzipped/pre-brottlied files (https://github.com/mholt/caddy/pull/1289)

mholt commented 7 years ago

See #1618 - my best guess is that will mostly resolve most of the GC issue.