vlast3k commented 7 months ago

Issue

The containerMetadataFileCheckTimeout is hard-coded to 600, and there is no way to modify it in the drain script where the shutdown is called here. In cases where the "healtcheck-timeout", "rep.evacuation_timeout_in_seconds" and "graceful-shutdown-time" are large enough, on the 600th second, silk shuts down the network, and containers are not able to properly complete their tasks.

Context

In our environments we set the "healtcheck-timeout" to 600 secs, and the "rep.evacuation_timeout_in_seconds" to 720 seconds.

Steps to Reproduce

Create a an app with 10 minute healtcheck timeout, do it in a way that it uses all the 10 minutes. Set the "rep.evacuation_timeout_in_seconds" to e.g. 720 mins And then on SIGTERM, try to access network resources from the app. It will fail

Expected result

The containers are able to use the network until they are properly terminated

Current result

Containers may not have network when they need to do the gracefull shutdown

Possible Fix

One option would be to do some calculations, based on links between bosh releases, and set the containerMetadataFileCheckTimeout to be equal to the "rep.evacuation_timeout_in_seconds" + "graceful_shutdown_time" + some buffer. We've already tried it for another property - the number of containers in rep, and in garden and it did not workout quite well, so we ended up with two properties.

So the proposal here would be to add one additional property in the silk-release called" containerMetadataFileCheckTimeout, which would be added to the command here , only if it is set in the descriptor

Additional Context

klapkov commented 7 months ago

Added a PR to make the timeout configurable: https://github.com/cloudfoundry/silk-release/pull/111

MarcPaquette commented 6 months ago

111 resolves this issue. Closing it out.

cloudfoundry / silk-release

Make "containerMetadataFileCheckTimeout" property of "silk-daemon-shutdown" configurable #110