abiosoft / caddy-docker

Docker container for Caddy
https://hub.docker.com/r/abiosoft/caddy/
MIT License
768 stars 315 forks source link

Problem when using a restart policy if Caddy errors out #65

Open francislavoie opened 7 years ago

francislavoie commented 7 years ago

I want to use restart: unless-stopped on my containers, but I realized that doing so with Caddy would be a bad idea on first deploy in the case that DNS/ports/etc are incorrect, because auto-TLS will fail. If Caddy exits because of TLS errors, the restart policy starts it right back up again and that means a rate limit will quickly be hit.

I'm not totally sure how to fix this (which is why I'm opening an issue, maybe you have a good idea for this @abiosoft), but this is my theory - I think we should have a bash script as the entrypoint, which can write to a file when Caddy exits with an error before doing the exit itself. If it boots back up, the entrypoint script should check the existence of that file and do an early cancel or something, to prevent it from trying TLS again and likely failing again.

Basically the idea is just doing an extra layer to the exponential back-off that the docker restart policies have to be safer from hitting rate limits.

I don't think there's different error codes for categories of errors in Caddy, having it return different codes than just 1 might be good to help differentiate for something like this.

Does this make sense?

mholt commented 7 years ago

We could have Caddy return a different error codes for pre-start errors.

francislavoie commented 7 years ago

That would be a good start. :smile:

mholt commented 7 years ago

Anyone know how to change the exit code of log.Fatal? :confused:

mholt commented 7 years ago

Eh, nevermind: quitting before the servers are fully started is most common I think (especially for unexpected stops, when the exit code is of interest), so I'll leave that a code of 1. The only other place I can find where Caddy exits (either with log.Fatal* or os.Exit) is when processing signals. I can change those to values > 1 easily enough. Will that help?

francislavoie commented 6 years ago

I'm finally getting around to trying to implement this. This is my current revision of the docker-entrypoint.sh script.

#!/bin/ash

# The file that tracks if this container has been started with errors
errorfile="/root/.caddy/haserror"

# Implement a reset command to remove the errorfile
if [ "$1" = 'reset' ]; then
    rm -f "$errorfile"
    exit 0
fi

# Caddy is being run normally
if [ "$1" = '/usr/bin/caddy' ] || [ "$1" = 'caddy' ]; then

    # Quit early if the error file exists
    # We do this so that we don't hit Let's Encrypt rate limits
    # if we have startup errors
    if [ -e "$errorfile" ]; then
        echo "Exiting, we previously ran with an error."
        exit 1
    fi

    # Run caddy
    "$@"

    # Get caddy exit code
    cmdoutput="$?"

    # Error 1 is a startup error, other exit codes don't concern us
    # See https://caddyserver.com/docs/cli#exit-codes
    if [ $cmdoutput -eq 1 ]; then
        echo "Exited with error $cmdoutput."
        touch "$errorfile"
    fi

    # Exit with the same status code as caddy did
    exit "$cmdoutput"
fi

# If the command is not caddy, we'll just run it normally
exec "$@"

So when this is run, the idea is that it will keep restarting if a restart policy is specified with exponential backoff (I think 1.5s -> 3s -> 6s -> 12s and so on), but this script will make all subsequent restarts exit early until the user gets rid of the error file. I set it up with a built-in reset command that removes the file, used like this:

docker-compose run --rm --no-deps caddy reset

I have the /root/.caddy folder as a volume, so it gets shared between containers, so running a one-off instance has access to the same volume and it removes that file.

I'm considering adding some logic that checks the file modification time with stat -c %Y $errorfile to check if it's been there for over something like 30 minutes where it should attempt to start normally again... some amount of time where it won't hit the rate limit anymore but stop being an encumbrance to users if they come back to it later. I'm not sure if it's actually beneficial or not.

This entire thing is a stupid hack for something that I think should be supported by Docker anyways.