The containerMetadataFileCheckTimeout is hard-coded to 600, and there is no way to modify it in the drain script where the shutdown is called here. In cases where the "healtcheck-timeout", "rep.evacuation_timeout_in_seconds" and "graceful-shutdown-time" are large enough, on the 600th second, silk shuts down the network, and containers are not able to properly complete their tasks.
Context
In our environments we set the "healtcheck-timeout" to 600 secs, and the "rep.evacuation_timeout_in_seconds" to 720 seconds.
Steps to Reproduce
Create a an app with 10 minute healtcheck timeout, do it in a way that it uses all the 10 minutes.
Set the "rep.evacuation_timeout_in_seconds" to e.g. 720 mins
And then on SIGTERM, try to access network resources from the app. It will fail
Expected result
The containers are able to use the network until they are properly terminated
Current result
Containers may not have network when they need to do the gracefull shutdown
Possible Fix
One option would be to do some calculations, based on links between bosh releases, and set the containerMetadataFileCheckTimeout to be equal to the "rep.evacuation_timeout_in_seconds" + "graceful_shutdown_time" + some buffer.
We've already tried it for another property - the number of containers in rep, and in garden and it did not workout quite well, so we ended up with two properties.
So the proposal here would be to add one additional property in the silk-release called"
containerMetadataFileCheckTimeout, which would be added to the command here , only if it is set in the descriptor
Issue
The
containerMetadataFileCheckTimeout
is hard-coded to 600, and there is no way to modify it in the drain script where the shutdown is called here. In cases where the "healtcheck-timeout", "rep.evacuation_timeout_in_seconds" and "graceful-shutdown-time" are large enough, on the 600th second, silk shuts down the network, and containers are not able to properly complete their tasks.Context
In our environments we set the "healtcheck-timeout" to 600 secs, and the "rep.evacuation_timeout_in_seconds" to 720 seconds.
Steps to Reproduce
Create a an app with 10 minute healtcheck timeout, do it in a way that it uses all the 10 minutes. Set the "rep.evacuation_timeout_in_seconds" to e.g. 720 mins And then on SIGTERM, try to access network resources from the app. It will fail
Expected result
The containers are able to use the network until they are properly terminated
Current result
Containers may not have network when they need to do the gracefull shutdown
Possible Fix
One option would be to do some calculations, based on links between bosh releases, and set the containerMetadataFileCheckTimeout to be equal to the "rep.evacuation_timeout_in_seconds" + "graceful_shutdown_time" + some buffer. We've already tried it for another property - the number of containers in rep, and in garden and it did not workout quite well, so we ended up with two properties.
So the proposal here would be to add one additional property in the silk-release called"
containerMetadataFileCheckTimeout
, which would be added to the command here , only if it is set in the descriptorAdditional Context