Admin Endpoint constantly crashing

chris-windscribe commented 2 years ago

Hoping someone can shed some light on this. We have Caddy deployed in a reverse proxy configuration with the admin endpoint exposed to a localhost. We have a script which hits the admin endpoint every 10 minutes to update a list of permitted IP addresses that it retrieves from an external API.

We are observing that every few hours the admin endpoint in Caddy will stop responding, any attempted to curl localhost:2019 will result in timeouts and log files indicate Caddy is experiencing an 'io timeout,' When the issue occurs we observe the following in our log:

Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.1752336,"logger":"admin.api","msg":"received request","method":"PATCH","host":"127.0.0.1:2019","uri":"/config/apps/http/servers/srv0/routes/0/handle/0/routes/0/match/0/remote_ip/ranges/","remote_ip":"127.0.0.1","remote_port":"48012","headers":{"Accept":["*/*"],"Connection":["close"],"Content-Length":["9300"],"Content-Type":["application/json"],"User-Agent":["curl/7.68.0"]}}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.1816225,"logger":"admin","msg":"admin endpoint started","address":"tcp/127.0.0.1:2019","enforce_origin":false,"origins":["//127.0.0.1:2019","//localhost:2019","//[::1]:2019"]}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.1836116,"logger":"http","msg":"skipping automatic certificate management because one or more matching certificates are already loaded","domain":"pki.int.windscribe.com","server_name":"srv0"}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.184285,"logger":"http","msg":"enabling automatic HTTP->HTTPS redirects","server_name":"srv0"}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.1869717,"logger":"tls.cache.maintenance","msg":"started background certificate maintenance","cache":"0xc000153ab0"}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.1875331,"logger":"admin.api","msg":"received request","method":"PATCH","host":"127.0.0.1:2019","uri":"/config/apps/http/servers/srv0/routes/0/handle/0/routes/1/match/0/remote_ip/ranges/","remote_ip":"127.0.0.1","remote_port":"48014","headers":{"Accept":["*/*"],"Connection":["close"],"Content-Length":["180"],"Content-Type":["application/json"],"User-Agent":["curl/7.68.0"]}}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.1887658,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 5ms"}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.194842,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 10ms"}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.2056587,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 20ms"}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.2267587,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 40ms"}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.267665,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 80ms"}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.348986,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 160ms"}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.5105457,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 320ms"}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.8328366,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 640ms"}
Aug  8 17:50:02 pki-do caddy[174943]: {"level":"info","ts":1659995402.249503,"logger":"tls.cache.maintenance","msg":"stopped background certificate maintenance","cache":"0xc0000eed20"}
Aug  8 17:50:02 pki-do caddy[174943]: {"level":"info","ts":1659995402.251662,"msg":"config is unchanged"}
Aug  8 17:50:02 pki-do caddy[174943]: {"level":"info","ts":1659995402.27559,"logger":"admin","msg":"stopped previous server","address":"tcp/127.0.0.1:2019"}
Aug  8 17:50:02 pki-do caddy[174943]: {"level":"info","ts":1659995402.4745073,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 1s"}
Aug  8 17:50:03 pki-do caddy[174943]: {"level":"info","ts":1659995403.4750612,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 1s"}
Aug  8 17:50:04 pki-do caddy[174943]: {"level":"info","ts":1659995404.4761431,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 1s"}

At the moment our solution is to simply restart caddy and I've implemented a cron job to do this every hour, but it seems like a horribly hacky solution. Anyone experienced this before or have any ideas?

mholt commented 2 years ago

Thanks for opening an issue! We'll look into this.

It's not immediately clear to me what is going on, so I'll need your help to understand it better.

Ideally, we need to be able to reproduce the bug in the most minimal way possible. This allows us to write regression tests to verify the fix is working. If we can't reproduce it, then you'll have to test our changes for us until it's fixed -- and then we can't add test cases, either.

I've attached a template below that will help make this easier and faster! This will require some effort on your part -- please understand that we will be dedicating time to fix the bug you are reporting if you can just help us understand it and reproduce it easily.

This template will ask for some information you've already provided; that's OK, just fill it out the best you can. :+1: I've also included some helpful tips below the template. Feel free to let me know if you have any questions!

Thank you again for your report, we look forward to resolving it!

Template

## 1. Environment

### 1a. Operating system and version

```
paste here
```

### 1b. Caddy version (run `caddy version` or paste commit SHA)

```
paste here
```

### 1c. Go version (if building Caddy from source; run `go version`)

```
paste here
```

## 2. Description

### 2a. What happens (briefly explain what is wrong)

### 2b. Why it's a bug (if it's not obvious)

### 2c. Log output

```
paste terminal output or logs here
```

### 2d. Workaround(s)

### 2e. Relevant links

## 3. Tutorial (minimal steps to reproduce the bug)

Instructions -- please heed otherwise we cannot help you (help us help you!)

Environment: Please fill out your OS and Caddy versions, even if you don't think they are relevant. (They are always relevant.) If you built Caddy from source, provide the commit SHA and specify your exact Go version.
Description: Describe at a high level what the bug is. What happens? Why is it a bug? Not all bugs are obvious, so convince readers that it's actually a bug.
- 2c) Log output: Paste terminal output and/or complete logs in a code block. DO NOT REDACT INFORMATION except for credentials.
- 2d) Workaround: What are you doing to work around the problem in the meantime? This can help others who encounter the same problem, until we implement a fix.
- 2e) Relevant links: Please link to any related issues, pull requests, docs, and/or discussion. This can add crucial context to your report.
Tutorial: What are the minimum required specific steps someone needs to take in order to experience the same bug? Your goal here is to make sure that anyone else can have the same experience with the bug as you do. You are writing a tutorial, so make sure to carry it out yourself before posting it. Please:
- Start with an empty config. Add only the lines/parameters that are absolutely required to reproduce the bug.
- Do not run Caddy inside containers.
- Run Caddy manually in your terminal; do not use systemd or other init systems.
- If making HTTP requests, avoid web browsers. Use a simpler HTTP client instead, like curl.
- Do not redact any information from your config (except credentials). Domain names are public knowledge and often necessary for quick resolution of an issue!
- Note that ignoring this advice may result in delays, or even in your issue being closed. 😞 Only actionable issues are kept open, and if there is not enough information or clarity to reproduce the bug, then the report is not actionable.

Example of a tutorial:

Create a config file: ``` { ... } ``` Open terminal and run Caddy: ``` $ caddy ... ``` Make an HTTP request: ``` $ curl ... ``` Notice that the result is ___ but it should be ___.

chris-windscribe commented 2 years ago

Hello,

Thanks for the response, here's the template filled out to the best of my ability

1. Environment

1a. Operating system and version

Ubuntu 20.04.4 LTS \n \l

1b. Caddy version (run `caddy version` or paste commit SHA)

v2.5.2 h1:eCJdLyEyAGzuQTa5Mh3gETnYWDClo1LjtQm2q9RNZrs=

1c. Go version (if building Caddy from source; run `go version`)

N/A Installed from apt

2. Description

2a. What happens (briefly explain what is wrong)

Admin endpoint stops responding after an indeterminate amount of time. Admin endpoint is being hit by a script every 10 minutes which populates an allow-list of IP addresses

2b. Why it's a bug (if it's not obvious)

2c. Log output

Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.1752336,"logger":"admin.api","msg":"received request","method":"PATCH","host":"127.0.0.1:2019","uri":"/config/apps/http/servers/srv0/routes/0/handle/0/routes/0/match/0/remote_ip/ranges/","remote_ip":"127.0.0.1","remote_port":"48012","headers":{"Accept":["*/*"],"Connection":["close"],"Content-Length":["9300"],"Content-Type":["application/json"],"User-Agent":["curl/7.68.0"]}}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.1816225,"logger":"admin","msg":"admin endpoint started","address":"tcp/127.0.0.1:2019","enforce_origin":false,"origins":["//127.0.0.1:2019","//localhost:2019","//[::1]:2019"]}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.1836116,"logger":"http","msg":"skipping automatic certificate management because one or more matching certificates are already loaded","domain":"pki.int.windscribe.com","server_name":"srv0"}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.184285,"logger":"http","msg":"enabling automatic HTTP->HTTPS redirects","server_name":"srv0"}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.1869717,"logger":"tls.cache.maintenance","msg":"started background certificate maintenance","cache":"0xc000153ab0"}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.1875331,"logger":"admin.api","msg":"received request","method":"PATCH","host":"127.0.0.1:2019","uri":"/config/apps/http/servers/srv0/routes/0/handle/0/routes/1/match/0/remote_ip/ranges/","remote_ip":"127.0.0.1","remote_port":"48014","headers":{"Accept":["*/*"],"Connection":["close"],"Content-Length":["180"],"Content-Type":["application/json"],"User-Agent":["curl/7.68.0"]}}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.1887658,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 5ms"}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.194842,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 10ms"}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.2056587,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 20ms"}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.2267587,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 40ms"}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.267665,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 80ms"}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.348986,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 160ms"}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.5105457,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 320ms"}
Aug  8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.8328366,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 640ms"}
Aug  8 17:50:02 pki-do caddy[174943]: {"level":"info","ts":1659995402.249503,"logger":"tls.cache.maintenance","msg":"stopped background certificate maintenance","cache":"0xc0000eed20"}
Aug  8 17:50:02 pki-do caddy[174943]: {"level":"info","ts":1659995402.251662,"msg":"config is unchanged"}
Aug  8 17:50:02 pki-do caddy[174943]: {"level":"info","ts":1659995402.27559,"logger":"admin","msg":"stopped previous server","address":"tcp/127.0.0.1:2019"}
Aug  8 17:50:02 pki-do caddy[174943]: {"level":"info","ts":1659995402.4745073,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 1s"}
Aug  8 17:50:03 pki-do caddy[174943]: {"level":"info","ts":1659995403.4750612,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 1s"}
Aug  8 17:50:04 pki-do caddy[174943]: {"level":"info","ts":1659995404.4761431,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 1s"}

2d. Workaround(s)

Restart Caddy

2e. Relevant links

3. Tutorial (minimal steps to reproduce the bug)

Install Caddy via apt
Use the following Caddyfile (/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile)
```
{
order error after reverse_proxy
}
```

pki.int.windscribe.com:8200 {

tls /etc/ssl/pki.int.windscribe.com/fullchain.pem /etc/ssl/pki.int.windscribe.com/privkey.pem @nodes_prd { path "/v1/Windscribe Core CA X2/issue/node_server_selfmanaged_prd" remote_ip 149.56.10.85 }

@nodes_stg { path "/v1/Windscribe Core CA X2/issue/node_server_selfmanaged_stg" remote_ip 149.56.10.85 }

reverse_proxy @nodes_prd https://pki.int.windscribe.com:8200 reverse_proxy @nodes_stg https://pki.int.windscribe.com:8200 reverse_proxy /v1/auth/approle/login https://pki.int.windscribe.com:8200

error 403 { message "Access is Denied" }

}

3. Run the following script every 10 minutes (eventually the admin endpoint will hang)

curl --fail -X PATCH -d ["192.168.0.1","192.168.0.2","192.168.0.3"] -H "Content-Type: application/json" http://127.0.0.1:2019/config/apps/http/servers/srv0/routes/0/handle/0/routes/1/match/0/remote_ip/ranges/

mholt commented 2 years ago

Great, thanks for the info. If you run Caddy built from the reuseport branch, does it still have the issue? See: https://github.com/caddyserver/caddy/pull/4705

This isn't the first time I've seen this error but I was certain I'd fixed it. And what I've observed hasn't been nearly so consistent or frequent as you say it is. Anyway, my backup plan is that PR. I have one other company testing that as we speak.

chris-windscribe commented 2 years ago

Sure.

I've gone ahead and compiled a binary from that branch and have it running in our environment. Will advise how it looks after a couple of days, should know pretty quickly if there is still an issue.

mholt commented 2 years ago

@chris-windscribe Awesome thanks, keep me posted! I'm very interested in this.

chris-windscribe commented 2 years ago

@mholt I haven't seen the issue re-occur in the past week (previously we would be unable to go more then 48 hours without the issue), so this is looking quite promising!

Any idea when the changes will be merged into Master?

mholt commented 2 years ago

@chris-windscribe That's great news! Thanks for verifying. That's the first production results I've heard of so far; I'm waiting on one other company to verify as well. I'm hoping to merge within a couple weeks at this point.

chris-windscribe commented 2 years ago

Thanks for the update! Will advise if we have any issues in the meantime.

JeDaYoshi commented 2 years ago

Can also comment that I've noticed the admin endpoint crash issue after 6 hours-2 days of uptime.

However, since applying the reuseport branch's patches, it has not happened so far.

mholt commented 2 years ago

@JeDaYoshi That's great news, thanks for the information! Will likely merge this very soon.

JeDaYoshi commented 2 years ago

An update - it was working well so far, however, it just crashed a few minutes ago in my case despite having the patch :(

I compiled from commit 4c282e86dae16359370f410e29dc96dc5aeeb449 + manually merging the branch in, using Go v1.19 on Linux (Fedora 36).

mholt commented 2 years ago

@JeDaYoshi What are your full logs? And can you paste the result of caddy version?

This error should be impossible since we aren't even using any of the same code for reloads with regards to sockets on that branch.

JeDaYoshi commented 2 years ago

I checked again and looks like it was a mistake on my alarms. I apologise - it is actually still working!

mholt commented 2 years ago

Whew. Don't give me a heart attack like that! :laughing: Thanks for confirming quickly!

mholt commented 2 years ago

Should be resolved now that #4705 is merged

caddyserver / caddy