TritonDataCenter / containerpilot

A service for autodiscovery and configuration of applications running in containers
Mozilla Public License 2.0
1.12k stars 136 forks source link

Container Pilot process get hung and cannot recover when health check timeouts continues for more than an hour #590

Open kapilraju opened 3 years ago

kapilraju commented 3 years ago

We hit an issue recently where in a container we had two Container Pilot jobs, one to start a springboot java process and another for NGINX process, both of them having their own health check endpoints configured as -

            health: {
                exec: "/usr/bin/curl --fail -s -o <HEALTH CHECK ENDPOINTS>,
                interval: 10,
                ttl: 25,
                timeout: "30s"
            },

Design is, Container starts with 443 port mapped, inside the container NGINX listens on 443 and forward the request to springboot java process.

During a database outage, we saw a badly written springboot health check endpoint not returning any response and experiencing high latency, resulting into container pilot printing logs "timeout after 30s" for springboot health check endpoint.

The puzzling thing observed was if this situation continuous(i.e. springboot has not recovered) for around 1 hour 7 minutes(this is consistent behaviour with Container Pilot), container pilot starts printing the logs "timeout after 30s" for NGINX process. this NGINX process has nothing to do with database and its health check endpoint doesn't talk to any other process.

At this point if you login to container, do a curl to both endpoints you can see NGINX health check returns fine and springboot health check also returns fine (in our case it was returning after 30 sec due to underlying database issue)

From this point onwards even after database is normal, springboot is healthy, container pilot gets into this hung state and cannot recover without a restart, which means the container will never be registered to Consul even after its healthy.

Steps to reproduce -