gzip middleware - downloads pages when the server errors

thomas4019 commented 9 years ago

When fastcgi has a problem and returns and 500 internal server error, it downloads a .gz file which has the file contents "Internal Server Error". Someone reported that this was happening on caddyserver.com as well so I think the issue is related to gzip. https://news.ycombinator.com/item?id=9455468

mholt commented 9 years ago

Even though -- as you said -- it doesn't appear to be a problem exclusive to fastcgi, I did just notice that fastcgi returns the status code even though it wrote the entire response already. In other words, if the fastcgi responder gives back a status >= 400, Caddy returns that up the middleware chain, indicating that it hasn't handled the error, but it actually has. I think the return should be changed to return 0, nil.

As for the rest of the gz problem, still looking into that...

mholt commented 9 years ago

Possible suspect, fileserver.go:

// Note: Errors generated by ServeContent are written immediately
// to the response. This usually only happens if seeking fails (rare).
http.ServeContent(w, r, d.Name(), d.ModTime(), f)

Haven't verified this theory yet, but with this critical call I simply hope an error doesn't occur. If it fails, there's no way to know... that I know of.

Again - not sure if this is culprit. I'm not sure that this would explain the gz downloads. More probing is needed...

mholt commented 9 years ago

It seems that after extended durations of heavy use, this bug appears. I saw it again this morning after the site got a lot of traffic from the other half of the world... restarting the server fixes it. I'll keep probing but others are welcome to look into this too.

mholt commented 9 years ago

If it is because we're out of file descriptors, I know a way to reproduce the problem (on Unix systems):

Set ulimit -n 512 to lower the limit for open file descriptors
Run caddy from the same shell
In a different shell, run wrk -t256 -c512 -d10s http://localhost:2015 - this should use, on average, 2 connections per thread, and 512 connections is just enough to overtake the file descriptors; using -c511 doesn't cause "too many open files" in my testing
wrk might segfault :smile:

Unfortunately, when I modified the code so it didn't open any files (it just served a byte array with the file contents in it already), I didn't see any improvement; as in, I still saw "too many open files" with the same ulimit -n value and wrk parameters. Apparently opening files doesn't use up file descriptors, just connections? (Honestly I'm so confused; I'll have to do some more searching around and reading.)

mholt commented 9 years ago

Okay! Getting close.

When running the experiment above on the server which opens the file and serves it, the output is something like this:

  468907 requests in 10.00s, 1.07GB read
  Socket errors: connect 0, read 847, write 0, timeout 1629
  Non-2xx or 3xx responses: 389549

When repeating the experiment on the server which serves from memory only, and always writes a 200 status, the output is this:

  310931 requests in 9.99s, 3.98GB read
  Socket errors: connect 0, read 860, write 0, timeout 1695

In both cases, the server runs out of file descriptors, outputting errors like this:

2015/04/30 05:53:03 http: Accept error: accept tcp 127.0.0.1:2015: too many open files; retrying in 5ms
2015/04/30 05:53:03 http: Accept error: accept tcp 127.0.0.1:2015: too many open files; retrying in 10ms
2015/04/30 05:53:03 http: Accept error: accept tcp 127.0.0.1:2015: too many open files; retrying in 20ms
2015/04/30 05:53:03 http: Accept error: accept tcp 127.0.0.1:2015: too many open files; retrying in 40ms
2015/04/30 05:53:03 http: Accept error: accept tcp 127.0.0.1:2015: too many open files; retrying in 80ms
2015/04/30 05:53:03 http: Accept error: accept tcp 127.0.0.1:2015: too many open files; retrying in 160ms
2015/04/30 05:53:03 http: Accept error: accept tcp 127.0.0.1:2015: too many open files; retrying in 320ms
2015/04/30 05:53:03 http: Accept error: accept tcp 127.0.0.1:2015: too many open files; retrying in 640ms
2015/04/30 05:53:04 http: Accept error: accept tcp 127.0.0.1:2015: too many open files; retrying in 1s
2015/04/30 05:53:05 http: Accept error: accept tcp 127.0.0.1:2015: too many open files; retrying in 1s

But notice that in the first experiment, the vast majority of the responses were non-200. My logs (from the experiments run locally) showed a bunch of 404s. This would be consistent with comments like this which report the same thing.

My hypothesis: when the server ran out of file descriptors, the connections that were already open (due to keep-alive, probably?) were able to request another page but the server wasn't able to open the file. That explains the 404s which I did confirm originated on this line.

The gz files that were downloaded are probably because the gzipped content was corrupted one way or another The only place the server would have returned a 500 Internal Server Error (which I also saw by inspecting in a text editor the gz files that were downloaded) is in http.ServeContent() where I lose control of error handling, because it's kind of a black-box function. I suspect that one of the file operations in there was bombing (that function does write 500 errors upon some conditions). I can't explain that part yet but I'm almost sure this is what is going on.

Best solution: raise ulimit -n to > 1024. :smile:

Caddy solution: need to handle all errors properly, meaning inline http.ServeContent() and modify it so errors are handled our own way. Also we need to handle more gracefully the case of being out of file descriptors. Working on that...

mholt commented 9 years ago

Figured it out. I was able to reproduce all reported symptoms.

Although http.ServeContent() is a potential problem because it does its own error handling, it was not the source of the errors this time - the Internal Server Error actually came from a defer-recover thing going on in the main handler to save the server in case of panic. I've fixed the error handling middleware so it doesn't panic and I've fixed the fileServer so it reports a 404 only if the file is actually not found (before it just would assume that if there was an error opening the file, the file didn't exist) - it will now return a 503 error otherwise - which is what we should have been seeing.

The gz file download was due to corrupted gzip output because of multiple calls to rw.WriteHeader. The extra call was in the top-level handler after the gzipWriter was already closed, because it was recovering from a panic.

I'm pushing fixes now. I've not been able to reproduce any of the symptoms after these fixes were implemented. (Boy I can't wait to get around to refactoring some things and get all these crazy edge cases under test.)

caddyserver / caddy

gzip middleware - downloads pages when the server errors #15