Closed chris-windscribe closed 2 years ago
Thanks for opening an issue! We'll look into this.
It's not immediately clear to me what is going on, so I'll need your help to understand it better.
Ideally, we need to be able to reproduce the bug in the most minimal way possible. This allows us to write regression tests to verify the fix is working. If we can't reproduce it, then you'll have to test our changes for us until it's fixed -- and then we can't add test cases, either.
I've attached a template below that will help make this easier and faster! This will require some effort on your part -- please understand that we will be dedicating time to fix the bug you are reporting if you can just help us understand it and reproduce it easily.
This template will ask for some information you've already provided; that's OK, just fill it out the best you can. :+1: I've also included some helpful tips below the template. Feel free to let me know if you have any questions!
Thank you again for your report, we look forward to resolving it!
## 1. Environment
### 1a. Operating system and version
```
paste here
```
### 1b. Caddy version (run `caddy version` or paste commit SHA)
```
paste here
```
### 1c. Go version (if building Caddy from source; run `go version`)
```
paste here
```
## 2. Description
### 2a. What happens (briefly explain what is wrong)
### 2b. Why it's a bug (if it's not obvious)
### 2c. Log output
```
paste terminal output or logs here
```
### 2d. Workaround(s)
### 2e. Relevant links
## 3. Tutorial (minimal steps to reproduce the bug)
Environment: Please fill out your OS and Caddy versions, even if you don't think they are relevant. (They are always relevant.) If you built Caddy from source, provide the commit SHA and specify your exact Go version.
Description: Describe at a high level what the bug is. What happens? Why is it a bug? Not all bugs are obvious, so convince readers that it's actually a bug.
Tutorial: What are the minimum required specific steps someone needs to take in order to experience the same bug? Your goal here is to make sure that anyone else can have the same experience with the bug as you do. You are writing a tutorial, so make sure to carry it out yourself before posting it. Please:
curl
.Example of a tutorial:
Create a config file: ``` { ... } ``` Open terminal and run Caddy: ``` $ caddy ... ``` Make an HTTP request: ``` $ curl ... ``` Notice that the result is ___ but it should be ___.
Hello,
Thanks for the response, here's the template filled out to the best of my ability
Ubuntu 20.04.4 LTS \n \l
caddy version
or paste commit SHA)v2.5.2 h1:eCJdLyEyAGzuQTa5Mh3gETnYWDClo1LjtQm2q9RNZrs=
go version
)N/A Installed from apt
Admin endpoint stops responding after an indeterminate amount of time. Admin endpoint is being hit by a script every 10 minutes which populates an allow-list of IP addresses
Aug 8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.1752336,"logger":"admin.api","msg":"received request","method":"PATCH","host":"127.0.0.1:2019","uri":"/config/apps/http/servers/srv0/routes/0/handle/0/routes/0/match/0/remote_ip/ranges/","remote_ip":"127.0.0.1","remote_port":"48012","headers":{"Accept":["*/*"],"Connection":["close"],"Content-Length":["9300"],"Content-Type":["application/json"],"User-Agent":["curl/7.68.0"]}}
Aug 8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.1816225,"logger":"admin","msg":"admin endpoint started","address":"tcp/127.0.0.1:2019","enforce_origin":false,"origins":["//127.0.0.1:2019","//localhost:2019","//[::1]:2019"]}
Aug 8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.1836116,"logger":"http","msg":"skipping automatic certificate management because one or more matching certificates are already loaded","domain":"pki.int.windscribe.com","server_name":"srv0"}
Aug 8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.184285,"logger":"http","msg":"enabling automatic HTTP->HTTPS redirects","server_name":"srv0"}
Aug 8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.1869717,"logger":"tls.cache.maintenance","msg":"started background certificate maintenance","cache":"0xc000153ab0"}
Aug 8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.1875331,"logger":"admin.api","msg":"received request","method":"PATCH","host":"127.0.0.1:2019","uri":"/config/apps/http/servers/srv0/routes/0/handle/0/routes/1/match/0/remote_ip/ranges/","remote_ip":"127.0.0.1","remote_port":"48014","headers":{"Accept":["*/*"],"Connection":["close"],"Content-Length":["180"],"Content-Type":["application/json"],"User-Agent":["curl/7.68.0"]}}
Aug 8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.1887658,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 5ms"}
Aug 8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.194842,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 10ms"}
Aug 8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.2056587,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 20ms"}
Aug 8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.2267587,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 40ms"}
Aug 8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.267665,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 80ms"}
Aug 8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.348986,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 160ms"}
Aug 8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.5105457,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 320ms"}
Aug 8 17:50:01 pki-do caddy[174943]: {"level":"info","ts":1659995401.8328366,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 640ms"}
Aug 8 17:50:02 pki-do caddy[174943]: {"level":"info","ts":1659995402.249503,"logger":"tls.cache.maintenance","msg":"stopped background certificate maintenance","cache":"0xc0000eed20"}
Aug 8 17:50:02 pki-do caddy[174943]: {"level":"info","ts":1659995402.251662,"msg":"config is unchanged"}
Aug 8 17:50:02 pki-do caddy[174943]: {"level":"info","ts":1659995402.27559,"logger":"admin","msg":"stopped previous server","address":"tcp/127.0.0.1:2019"}
Aug 8 17:50:02 pki-do caddy[174943]: {"level":"info","ts":1659995402.4745073,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 1s"}
Aug 8 17:50:03 pki-do caddy[174943]: {"level":"info","ts":1659995403.4750612,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 1s"}
Aug 8 17:50:04 pki-do caddy[174943]: {"level":"info","ts":1659995404.4761431,"msg":"http: Accept error: accept tcp 127.0.0.1:2019: i/o timeout; retrying in 1s"}
Restart Caddy
/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile
)
{
order error after reverse_proxy
}
pki.int.windscribe.com:8200 {
tls /etc/ssl/pki.int.windscribe.com/fullchain.pem /etc/ssl/pki.int.windscribe.com/privkey.pem @nodes_prd { path "/v1/Windscribe Core CA X2/issue/node_server_selfmanaged_prd" remote_ip 149.56.10.85 }
@nodes_stg { path "/v1/Windscribe Core CA X2/issue/node_server_selfmanaged_stg" remote_ip 149.56.10.85 }
reverse_proxy @nodes_prd https://pki.int.windscribe.com:8200 reverse_proxy @nodes_stg https://pki.int.windscribe.com:8200 reverse_proxy /v1/auth/approle/login https://pki.int.windscribe.com:8200
error 403 { message "Access is Denied" }
}
3. Run the following script every 10 minutes (eventually the admin endpoint will hang)
curl --fail -X PATCH -d ["192.168.0.1","192.168.0.2","192.168.0.3"] -H "Content-Type: application/json" http://127.0.0.1:2019/config/apps/http/servers/srv0/routes/0/handle/0/routes/1/match/0/remote_ip/ranges/
Great, thanks for the info. If you run Caddy built from the reuseport
branch, does it still have the issue? See: https://github.com/caddyserver/caddy/pull/4705
This isn't the first time I've seen this error but I was certain I'd fixed it. And what I've observed hasn't been nearly so consistent or frequent as you say it is. Anyway, my backup plan is that PR. I have one other company testing that as we speak.
Sure.
I've gone ahead and compiled a binary from that branch and have it running in our environment. Will advise how it looks after a couple of days, should know pretty quickly if there is still an issue.
@chris-windscribe Awesome thanks, keep me posted! I'm very interested in this.
@mholt I haven't seen the issue re-occur in the past week (previously we would be unable to go more then 48 hours without the issue), so this is looking quite promising!
Any idea when the changes will be merged into Master?
@chris-windscribe That's great news! Thanks for verifying. That's the first production results I've heard of so far; I'm waiting on one other company to verify as well. I'm hoping to merge within a couple weeks at this point.
Thanks for the update! Will advise if we have any issues in the meantime.
Can also comment that I've noticed the admin endpoint crash issue after 6 hours-2 days of uptime.
However, since applying the reuseport
branch's patches, it has not happened so far.
@JeDaYoshi That's great news, thanks for the information! Will likely merge this very soon.
An update - it was working well so far, however, it just crashed a few minutes ago in my case despite having the patch :(
I compiled from commit 4c282e86dae16359370f410e29dc96dc5aeeb449
+ manually merging the branch in, using Go v1.19 on Linux (Fedora 36).
@JeDaYoshi What are your full logs? And can you paste the result of caddy version
?
This error should be impossible since we aren't even using any of the same code for reloads with regards to sockets on that branch.
I checked again and looks like it was a mistake on my alarms. I apologise - it is actually still working!
Whew. Don't give me a heart attack like that! :laughing: Thanks for confirming quickly!
Should be resolved now that #4705 is merged
Hoping someone can shed some light on this. We have Caddy deployed in a reverse proxy configuration with the admin endpoint exposed to a localhost. We have a script which hits the admin endpoint every 10 minutes to update a list of permitted IP addresses that it retrieves from an external API.
We are observing that every few hours the admin endpoint in Caddy will stop responding, any attempted to curl
localhost:2019
will result in timeouts and log files indicate Caddy is experiencing an 'io timeout,' When the issue occurs we observe the following in our log:At the moment our solution is to simply restart caddy and I've implemented a cron job to do this every hour, but it seems like a horribly hacky solution. Anyone experienced this before or have any ideas?