caddyserver / caddy

Fast and extensible multi-platform HTTP/1-2-3 web server with automatic HTTPS
https://caddyserver.com
Apache License 2.0
57.51k stars 4.01k forks source link

Caddy 2 stops answering requests after a few hours #3725

Closed AtjonTV closed 4 years ago

AtjonTV commented 4 years ago

I am running Caddy 2 in production since a three weeks and I observed a bug where Caddy would stop answering to any requests after about 16 to 18 hours with 30,000 to 40,000 requests happening.

I am Caddy as a reverse proxy for my Docker system and I use the cloudflare certificate for nearly all domains, the once with LE are managed manually.

The Caddy process is still running though it takes about a minute to respond to ANY commands that have to change Caddy's state.

I dont have error logs that give any information, when this happends there is simply no log output anymore.

I have compiled Caddy myself from the 2.1.1 tag with the Server header removed, no other modifications then that though so I dont see why that would be the cause.

The requests given to Caddy just timeout, regardless if they are Reverse Proxies or File servers. The Proxy targets still life and do still give valid responses.

mholt commented 4 years ago

Thanks for opening an issue! We'll look into this.

It's not immediately clear to me what is going on, so I'll need your help to understand it better.

Ideally, we need to be able to reproduce the bug in the most minimal way possible. This allows us to write regression tests to verify the fix is working. If we can't reproduce it, then you'll have to test our changes for us until it's fixed -- and then we can't add test cases, either.

I've attached a template below that will help make this easier and faster! This will require some effort on your part -- please understand that we will be dedicating time to fix the bug you are reporting if you can just help us understand it and reproduce it easily.

This template will ask for some information you've already provided; that's OK, just fill it out the best you can. :+1: I've also included some helpful tips below the template. Feel free to let me know if you have any questions!

Thank you again for your report, we look forward to resolving it!

Template

## 1. Environment

### 1a. Operating system and version

```
paste here
```

### 1b. Caddy version (run `caddy version` or paste commit SHA)

```
paste here
```

### 1c. Go version (if building Caddy from source; run `go version`)

```
paste here
```

## 2. Description

### 2a. What happens (briefly explain what is wrong)

### 2b. Why it's a bug (if it's not obvious)

### 2c. Log output

```
paste terminal output or logs here
```

### 2d. Workaround(s)

### 2e. Relevant links

## 3. Tutorial (minimal steps to reproduce the bug)

Helpful tips

  1. Environment: Please fill out your OS and Caddy versions, even if you don't think they are relevant. (They are always relevant.) If you built Caddy from source, provide the commit SHA and specify your exact Go version.

  2. Description: Describe at a high level what the bug is. What happens? Why is it a bug? Not all bugs are obvious, so convince readers that it's actually a bug.

    • 2c) Log output: Paste terminal output and/or complete logs in a code block. DO NOT REDACT INFORMATION except for credentials.
    • 2d) Workaround: What are you doing to work around the problem in the meantime? This can help others who encounter the same problem, until we implement a fix.
    • 2e) Relevant links: Please link to any related issues, pull requests, docs, and/or discussion. This can add crucial context to your report.
  3. Tutorial: What are the minimum required specific steps someone needs to take in order to experience the same bug? Your goal here is to make sure that anyone else can have the same experience with the bug as you do. You are writing a tutorial, so make sure to carry it out yourself before posting it. Please:

    • Start with an empty config. Add only the lines/parameters that are absolutely required to reproduce the bug.
    • Do not run Caddy inside containers.
    • Run Caddy manually in your terminal; do not use systemd or other init systems.
    • If making HTTP requests, avoid web browsers. Use a simpler HTTP client instead, like curl.
    • Do not redact any information from your config (except credentials). Domain names are public knowledge and often necessary for quick resolution of an issue!
    • Note that ignoring this advice may result in delays, or even in your issue being closed. 😞 Only actionable issues are kept open, and if there is not enough information or clarity to reproduce the bug, then the report is not actionable.

Example of a tutorial:

Create a config file: ``` { ... } ``` Open terminal and run Caddy: ``` $ caddy ... ``` Make an HTTP request: ``` $ curl ... ``` Notice that the result is ___ but it should be ___.
AtjonTV commented 4 years ago

1. Environment

1a. Operating system and version

Ubuntu 18.04.5 LTS
Linux 4.15.0-112-generic

1b. Caddy version (run caddy version or paste commit SHA)

(Custom compilation with Server header removed)

v2.1.4 h1:h49IjkBhLS3fOLA6HpHvLepinUK9DdeAbm/yHIWUKyA=

1c. Go version (if building Caddy from source; run go version)

Caddy was build on my local Gentoo system, the required shared objects are all resolved without issues.

go version go1.14.7 linux/amd64

2. Description

2a. What happens (briefly explain what is wrong)

I have a Caddy 2 reverse proxy server running and after a random set of hours or responses the Caddy server stops answering to responses so that all requests timeout.

2b. Why it's a bug (if it's not obvious)

2c. Log output

Running manually in a terminal is hard as its random, but I now started the caddy inside of a Screen to see what happens, I will have to wait until the issue reappears to provide whatever was spit out by Caddy, if anything.

2d. Workaround(s)

2e. Relevant links

3. Tutorial (minimal steps to reproduce the bug)

I dont have steps to reproduce but I suppose its a good idea to post a sanetized version (no password hashes and hiding which services are not behind cloudflare) of my Caddy file in the case any issues are with it:

caddy2_sanetized.txt

AtjonTV commented 4 years ago

Sadly I am unable to just "remove" stuff from the caddy file or only run a small set as this happens on a Production system that is required to be available :(

mholt commented 4 years ago

(Custom compilation with Server header removed)

Why are you doing that instead of simply removing it with config? What other things have changed, it makes it difficult to prove that it's a bug in our code base versus your custom build.

Where are the logs? You should at least have them from your system service, I imagine? Make sure to enable debug mode in your config.

We'll need to be able to reproduce the problem if we are to debug it. If not, I can only suggest things you can try to narrow it down, probably involving adding logs or prints.

AtjonTV commented 4 years ago

I wasnt able to remove the Header by config for some reason, so I removed it in code .. If you tell me how to do it in Caddyfile I will move back to the official binary.

I just looked into the screen and I found out what happens, though I dont know why:

image

The fun thing is, at the first look it seems as a upstream issue, but restarting caddy fixes what ever is broken here ..

I got similar messages with "too many open files" when running caddy reload too often after making changes to the Caddyfile.

magiruuvelvet commented 4 years ago

@AtjonTV looks like the file descriptor limit on your server is painfully low. Try to increase the limit by creating a file in /etc/sysctl.d/99-files.conf with the following contents:

# Allowed file descriptors to be open at the same time
fs.file-max = 2097152

# inotify user watches
fs.inotify.max_user_watches = 65536

# inotify user instances
fs.inotify.max_user_instances = 1024

Adjust limits as needed. Then run sysctl -p /etc/sysctl.d/99-files.conf to apply the new rules instantly. Changes apply automatically on reboot.

EDIT:

Those are my production server values:

fs.file-max=18446744073709551615
fs.inotify.max_user_watches=524288
fs.inotify.max_user_instances=2048
AtjonTV commented 4 years ago

I have the Ubuntu 16.04 default of 1.630.270, also this seems to me that some files are not closed properly by Caddy?

I couldnt set it to the 64-bit max value you use in prod as I am on Ubuntu and Ubuntu 16.04 is still mostly 32-bitâ„¢, so I now have the 32-bit max positive int -2 set.

mholt commented 4 years ago

@AtjonTV Thanks for the logs, we're a little closer now.

I wasnt able to remove the Header by config for some reason, so I removed it in code .. If you tell me how to do it in Caddyfile I will move back to the official binary.

Using Caddyfile:

header -Server

@magiruuvelvet's suggestions are good, raising that limit will probably solve your problem! We are pretty certain by this point that Caddy does close its resources when it is done with them; something we've seen before though is that people's proxy backends are leaving connections open for an unnecessarily long time sometimes.

I got similar messages with "too many open files" when running caddy reload too often after making changes to the Caddyfile.

About how many times?

To try to narrow this down, one thing to try might be to disable the access logs (or at least direct them to stdout instead of a file).

Also, this is unrelated, but I noticed you have a typo in your Caddyfile, look for htto. which should be http..

(PS. @magiruuvelvet your avatar is best girl. And I'm casually learning Japanese a little bit every day! Are you a native speaker, or how did you learn it?)

mholt commented 4 years ago

Also, note that the file descriptor limits are not necessary per-process, I think they are scoped to user or system (depending on config; maybe if one is higher than the other, I dunno how that works). It could be another process from the same user (or maybe not) that is using the file descriptors! Is there anything else you're running that's doing I/O? (For example, this? Is it closing all the files it opens?)

AtjonTV commented 4 years ago

I have moved to the official caddy build and confirmed that header -Server works.

The issue with caddy reload comes for the Certificate files specified under the tls directive. I would guess about 15 to 20 times until I got that error.

Caddy is the only application that accesses the cloudflare certificate files.

mholt commented 4 years ago

The issue with caddy reload comes for the Certificate files specified under the tls directive.

How do you know that, exactly? Are we talking about the same error ('too many open files')?

Caddy is the only application that accesses the cloudflare certificate files.

That doesn't matter; which files are being opened is irrelevant, it's just that something on the system is opening or keeping open too many files/sockets.

In case you didn't see it, check my second reply ^

AtjonTV commented 4 years ago

I know its about the certificates because the caddy reload command said it was about cloudflare.crt, and that its was opened too often. So my guess is that when reloading caddy from command line it doesnt correctly close and reopen tls certificates.

Though I would guess that with the change in max file descripters this will either take longer to happen or wont happen anymore, but I dont know that yet.

magiruuvelvet commented 4 years ago

fs.file-max is system global on Linux and applies to all processes at once. File descriptors are closed though when the belonging process closes or dies, so when you forget to close a file in an application it is closed at termination (including SIGKILL).

(PS. @mholt my native language is German, I'm teaching myself Japanese and keep getting better)

mholt commented 4 years ago

I know its about the certificates because the caddy reload command said it was about cloudflare.crt, and that its was opened too often.

Are you sure? What was the actual error message? If we're talking about "too many open files" then that doesn't mean that particular file was opened too many times.

@AtjonTV Is there anything else running on your machine that does I/O?

AtjonTV commented 4 years ago

I am trying to get that error to reoccurr as I forgot to safe the error message when it happened last time :(

I have a few docker containers running .. like 62 :sweat_smile: .. as I said, its a production machine.

In Public IO I only have Caddy, SSH and CIFS (for remote backups, only mounted and used at 00:05 every day) though.

As for now I think it would be best if I just let it run and sometimes check if something happened, I am running caddy in a screen now so I have all the stdout. I dont want to and dont have to reload caddy as often as of now as I am not doing any change to the routing accept for the small spelling fix and the server header, to trigger the error I was getting.

May I ask if it is possible to explicitly tell Caddy to use LE for specific domains? I have 2 that need to be LE as they cant be proxied through Cloudflare, but the Cloudflare certificate is a wildcard that matches the other two domains :(

francislavoie commented 4 years ago

I have a few docker containers running .. like 62 sweat_smile .. as I said, its a production machine.

There's your issue then. The file descriptor limit is shared across everything on the machine. You'll probably need to increase the limit to support all those things running.

AtjonTV commented 4 years ago

As I wrote earlier, I have set it to 2.147.483.645 now, which is way more then the original Ubuntu value of 1.630.270

I would guess that the HTTP Response issue is fixed now, I only know for sure if it doesnt break within the next 24 hours.

For the certificate Issue I would wait for it to happen and open a extra ticket with as much information as possible, as It only happend random when running caddy reload too often, and doesnt seem to be much related to this issue here.

mholt commented 4 years ago

Okay, sounds like a plan. Thanks.

May I ask if it is possible to explicitly tell Caddy to use LE for specific domains?

Definitely; that is Caddy's implicit default. Just don't specify your own certificate for those sites, and Caddy will use LE certs.

AtjonTV commented 4 years ago

Does not seem like that .. I removed the TLS entry for matrix and it still uses the cloudflare one which I haven't specified for that directive.

This is the caddy file:

matrix.atvg-studios.com {
    import default_config
    import unavailable
    reverse_proxy 172.30.1.101:80
    reverse_proxy /_matrix/* 172.30.1.102:8008
}

matrix.atvg-studios.com:8448 {
    import default_config
    import unavailable
    reverse_proxy 172.30.1.102:8008
}

If you look at the one I posted earlier, you can see that for using cloudflare I have a snippet named cloudflare, and I am not importing it for matrix, still it uses the cloudflare certificate as seen in the log:

image

And its actually live too:

image

mholt commented 4 years ago

If Caddy sees a certificate already loaded for a domain name, it won't duplicate that and also manage a separate certificate for the domain. (If so, then which one would it serve? It would have to choose between them, but either one is equally qualified. The Caddyfile maps manually-loaded certs to each site, but for the other one, it is ambiguous.)

You can override this behavior using the "automatic_https" property of the JSON config: https://caddyserver.com/docs/json/apps/http/servers/automatic_https/ - specifically, the "ignore_loaded_certificates" setting: https://caddyserver.com/docs/json/apps/http/servers/automatic_https/#ignore_loaded_certificates

By default, automatic HTTPS will obtain and renew certificates for qualifying hostnames. However, if a certificate with a matching SAN is already loaded into the cache, certificate management will not be enabled. To force automated certificate management regardless of loaded certificates, set this to true.

You can use caddy adapt to turn your Caddyfile into JSON.

AtjonTV commented 4 years ago

As I wrote in another issue, I have a 300 line caddy file that turns into a 2000 line json mess, which I dont want to administrate manually. So sadly that is not a option for me :(

mholt commented 4 years ago

And as I said before, if you want full control over your server, you can use the JSON. The Caddyfile isn't powerful enough to express all the flexibility of the JSON.

Or write a proposal and implement what you want.

AtjonTV commented 4 years ago

After changing the max file descriptor limit to the max possible for my server, it is now working without issues. Thus I am going to close this issue.

Thanks for all the help!