TritonDataCenter / containerpilot

A service for autodiscovery and configuration of applications running in containers
Mozilla Public License 2.0
1.12k stars 136 forks source link

Stability issues with signal events under SmartOS/LX #562

Closed jwreagor closed 6 years ago

jwreagor commented 6 years ago

Joyent Triton customer reported that there were stability issues running the latest ContainerPilot under LX-brand zones and Triton's Docker environment. The symptoms include containers that are running specific versions of ContainerPilot that eventually SIGABRT and drop a core file. The timing of this event and the conditions of the process tree running inside each container varies wildly.

By slowly peeling away at the problem we were able to diagnose the issue down to the signal events feature (#513) of 3.6.0 of ContainerPilot.

We have numerous cases that test this feature under normal Docker so we can say with almost certainty that this isn't the issue.

Also, private core dumps exist that might help debug the signal processing of the PID 1 inside these containers.

I'll want to come up with an example container to pass along to the Docker and OS teams in order for them to attempt to reproduce. We'll want to do this before pulling the feature or providing some sort of OS/feature toggle (under LX), either of which would not be optimal fixes.

Internal ticket is PRODSUP-23.

jwreagor commented 6 years ago

Hoping to include this fix in #559.

jwreagor commented 6 years ago

Customer mentioned that he's still able to crash ContainerPilot besides removing #513. Going to close this until we understand the problem more clearly or can reproduce.