caddyserver / caddy-docker

Source for the official Caddy v2 Docker Image
https://hub.docker.com/_/caddy
Apache License 2.0
420 stars 75 forks source link

Caddy sometimes locks up when running inside docker #374

Open deltamualpha opened 1 day ago

deltamualpha commented 1 day ago

I have a problem where my Caddy (version 2.8.4), running inside the official docker image, locks up and hangs on all network input, refuses to respond to kill signals from docker compose, and basically turns into a useless lump. There's no runaway memory or CPU usage or anything.

What logs or metrics can I pull to help debug what's going on here?

francislavoie commented 1 day ago

Thanks for opening an issue! We'll look into this.

It's not immediately clear to us what is going on, so we'll need your help to understand it better.

Ideally, we need to be able to reproduce the bug in the most minimal way possible. This allows us to write regression tests to verify the fix is working. If we can't reproduce it, then you'll have to test our changes for us until it's fixed -- and then we can't add test cases, either.

I've attached a template below that will help make this easier and faster! It will ask for some information you've already provided; that's OK, just fill it out the best you can. :+1:

I've also included some helpful tips below the template. Feel free to let me know if you have any questions!

Thank you again for your report, we look forward to resolving it!

Template

## 1. Environment

### 1a. Operating system and version

```
paste here
```

### 1b. Caddy version (run `caddy version` or paste commit SHA)

```
paste here
```

### 1c. Go version (if building Caddy from source; run `go version`)

```
paste here
```

## 2. Description

### 2a. What happens (briefly explain what is wrong)

### 2b. Why it's a bug (if it's not obvious)

### 2c. Log output

```
paste terminal output or logs here
```

### 2d. Workaround(s)

### 2e. Relevant links

## 3. Tutorial (minimal steps to reproduce the bug)

Helpful tips

  1. Environment: Please fill out your OS and Caddy versions, even if you don't think they are relevant. (They are always relevant.) If you built Caddy from source, provide the commit SHA and specify your exact Go version.

  2. Description: Describe at a high level what the bug is. What happens? Why is it a bug? Not all bugs are obvious, so convince readers that it's actually a bug.

    • 2c) Log output: Paste terminal output and/or complete logs in a code block. DO NOT REDACT INFORMATION except for credentials.
    • 2d) Workaround: What are you doing to work around the problem in the meantime? This can help others who encounter the same problem, until we implement a fix.
    • 2e) Relevant links: Please link to any related issues, pull requests, docs, and/or discussion. This can add crucial context to your report.
  3. Tutorial: What are the minimum required specific steps someone needs to take in order to experience the same bug? Your goal here is to make sure that anyone else can have the same experience with the bug as you do. You are writing a tutorial, so make sure to carry it out yourself before posting it. Please:

    • Start with an empty config. Add only the lines/parameters that are absolutely required to reproduce the bug.
    • Do not run Caddy inside containers.
    • Run Caddy manually in your terminal; do not use systemd or other init systems.
    • If making HTTP requests, avoid web browsers. Use a simpler HTTP client instead, like curl.
    • Do not redact any information from your config (except credentials). Domain names are public knowledge and often necessary for quick resolution of an issue!
    • Note that ignoring this advice may result in delays, or even in your issue being closed. 😞 Only actionable issues are kept open, and if there is not enough information or clarity to reproduce the bug, then the report is not actionable.

Example of a tutorial:

Create a config file: ``` { ... } ``` Open terminal and run Caddy: ``` $ caddy ... ``` Make an HTTP request: ``` $ curl ... ``` Notice that the result is ___ but it should be ___.
deltamualpha commented 1 day ago

Thanks. This is Ubuntu 22.04.5, fully patched, running in AWS. The trigger for this behavior is random, and getting telemetry from the broken container is more or less impossible once it's started.

I'm digging in some more, and the container:

  1. cannot be inspected using docker inspect (hangs)
  2. makes docker compose ps hang (presumably because it's trying to run docker inspect under the hood)

Running ps aux | grep caddy turns up the other two caddy processes I have running in containers, but not the wedged one, which makes me think that it somehow exited, but the container is still alive. From the Dockerfiles in this repo, it looks like there isn't a process supervisor or anything in the Dockerfile that would be running separately from caddy itself, so I'm not sure how docker is keeping the process alive if Caddy is dead...

In the end, the only fix is to restart the entire docker daemon, or reboot the server. Both of which are far more disruptive than I'd like.

deltamualpha commented 1 day ago

Ah, and a correction to what I said about "no runaway memory usage":

Image

Each of those domed spikes correlates with Caddy going unavailable by the end. (You can see what a steady-state container looks like to the left of the chart.)

mholt commented 1 day ago

Is that the Caddy process blowing up specifically?

If possible, please provide a profile: https://caddyserver.com/docs/profiling

A heap and goroutine profile would both be most useful!

deltamualpha commented 1 day ago

Next time it happens, I'll see if I can get it to spit one out, but as I said, once it gets into this state I'm not sure there's a process left to profile.

jjlin commented 1 day ago

If docker inspect isn't even working, that sounds like more of a Docker or OS issue than a Caddy issue. You didn't mention which version of Docker you're running, but you could try upgrading that to the latest.

mholt commented 1 day ago

Some users will just poll those endpoints every few seconds to nab a quick profile, then if the process has a problem they can look at the last one that it got.