envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.92k stars 4.8k forks source link

Figure out a minimal healthcheck pattern #31619

Open phlax opened 9 months ago

phlax commented 9 months ago

Some of our examples use a healthcheck which relies on the admin endpoint, and uses curl to check readiness

Ideally we want a really minimal healthcheck that requires little/no container/system tools to check, that we could include in the published images - adding curl/nc in the distroless container is not trivial/desirable

One way of doing this would be if there were a filesystem file that could be checked for. This would definitely be the ideal

Another possibility - which would still require admin and a util - would be to expose admin only on a unix socket and connect to that with socat

I have a current need for something like this - but there is also a recent ticket related to this - #31614

phlax commented 9 months ago

cc @bruegth

phlax commented 9 months ago

the other minimal option would be sending a signal i guess

phlax commented 9 months ago

thinking about this further and realizing that signals wont help as they are one way

bruegth commented 9 months ago

I'm using the file based http healthcheck on other distroless images too, currently without any problems.

dockerfile: HEALTHCHECK CMD ["/docker-healthcheck.sh"]

docker-healthcheck.sh.txt

But this requires /bin/bash as shell and it depends that envoy admin endpoint is unsecure (not using https)

phlax commented 9 months ago

nice, that gets rid of the need for adding a tool other than bash (defo better than my socket suggestion)

i guess the next question is a reliable readiness check

when i looked before the admin endpoint was the most reliable i could see but its generally not desirable to expose admin unless you need it

there is also some question about its reliability for health checking - cc @ravenblackx

i guess there is a more general question about what readiness means in the Envoy context (putting aside healthiness) - eg does it mean the server is running, or specific endpoints are listening

phlax commented 9 months ago

@bruegth if you were up for PRing to replace the curl/healthcheck in the examples with your one i think that would be a good advance on what we have (happy to follow up myself if you prefer)

phlax commented 9 months ago

based on your suggestion and a bit of experimentaion - the best ive got so far is this for distroless

FROM debian:bookworm-slim as debian-static
RUN apt-get -qq update -y \
    && apt-get install bash-static

FROM scratch as binaries
COPY --from=busybox /bin/cat /bin/cat
COPY --from=busybox /bin/echo /bin/echo
COPY --from=busybox /bin/sh /bin/sh
COPY --from=busybox /bin/timeout /bin/timeout
COPY --from=busybox /bin/grep /bin/grep
COPY --from=debian-static /usr/bin/bash-static /usr/bin/bash

FROM envoyproxy/envoy:distroless-dev
COPY --from=binaries \
    /bin/cat \
    /bin/echo \
    /bin/grep \
    /bin/sh \
    /bin/timeout \
    /bin/
COPY --from=binaries \
    /usr/bin/bash \
    /usr/bin/
ENV ENVOY_ADMIN_PORT="${ENVOY_ADMIN_PORT:-9901}" \
    HEALTHCHECK_PORT="${HEALTHCHECK_PORT:-${ENVOY_ADMIN_PORT}}"
COPY --chmod=755 <<EOF /usr/local/bin/healthcheck
#!/usr/bin/bash

HC_HOST="localhost"
HC_PORT="\${HEALTHCHECK_PORT:-9901}"
HC_PATH="/server_info"
HC_EXPECTED="LIVE"
HC_ENDPOINT="/dev/tcp/\${HC_HOST}/\${HC_PORT}"

exec 3<>"\$HC_ENDPOINT"
echo -e "GET "\${HC_PATH}" HTTP/1.1\\nhost: \${HC_HOST}:\${HC_PORT}\\n" >&3
timeout 1 cat <&3 | grep -m 1 state | grep -m 1 "\$HC_EXPECTED"
ERROR=\$?
exec 3<&-
exec 3>&-
exit \$ERROR
EOF

HEALTHCHECK \
    --interval=1s \
    --timeout=2s \
    --start-period=1s \
    --retries=3 \
    CMD /usr/local/bin/healthcheck

for distroless it would need these bins added , but the ubuntu container should be able to just use the hc directly

i think we dont want to put anyting in the distroless image itself so probs we should add another image target :distroless-hc or similar

bruegth commented 9 months ago

Your solution looks good to me.

@bruegth if you were up for PRing to replace the curl/healthcheck in the examples with your one i think that would be a good advance on what we have (happy to follow up myself if you prefer)

Sorry, but at moment I'm too busy to create an PR.

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.