We've observed recently that Diego Cells throw error logs (see below) once the management API is updated during a CloudFoundry upgrade.
These error logs are thrown in a loop until the diego cell is update / recreated during the lifecycle.
Not every Diego Cell throws these error logs initially but once the drain process starts the same can be observed in the VM where the drain is running.
Both cases trigger a prolonged drain for rep during the lifecycle which waits until the configured timeout before killing the process and proceeding with the update.
Restarting the rep process seems to fix this issue and the error logs are not thrown anymore. Furthermore, restarting rep before the drain happens also fixes the issue with the prolonged update as rep is able to exit properly.
{"timestamp":"2023-10-25T09:46:46.989872950Z","level":"error","source":"rep","message":"rep.evacuation-cleanup.delete-container.failed-to-delete-garden-container","data":{"error":"failed to cleanup bindmount artifacts","guid":"99008969-8540-4dd7-7249-0c72","session":"13.4"}} {"timestamp":"2023-10-25T09:46:46.989892741Z","level":"error","source":"rep","message":"rep.evacuation-cleanup.failed-to-delete-container","data":{"container-guid":"99008969-8540-4dd7-7249-0c72","error":"failed to cleanup bindmount artifacts","session":"13"}} {"timestamp":"2023-10-25T09:46:46.989758325Z","level":"error","source":"rep","message":"rep.evacuation-cleanup.delete-container.containerstore.destroy.node-destroy.failed-releasing-cache-key","data":{"Guid":"99008969-8540-4dd7-7249-0c72","cache-key":"buildpack-cflinuxfs3-lifecycle","dir":"/var/vcap/data/rep/shared/garden/download_cache/38b2a7ccd052cc6ca87458d02a7c6c7a-1695808881784314774-12.d","error":"Entry Not Found","guid":"99008969-8540-4dd7-7249-0c72","session":"13.4.1.1"}} {"timestamp":"2023-10-25T09:46:46.989831937Z","level":"error","source":"rep","message":"rep.evacuation-cleanup.delete-container.containerstore.destroy.failed-to-destroy-container","data":{"Guid":"99008969-8540-4dd7-7249-0c72","error":"failed to cleanup bindmount artifacts","guid":"99008969-8540-4dd7-7249-0c72","session":"13.4.1"}} {"timestamp":"2023-10-25T09:46:46.989770441Z","level":"error","source":"rep","message":"rep.evacuation-cleanup.delete-container.containerstore.destroy.node-destroy.failed-to-release-cached-deps","data":{"Guid":"99008969-8540-4dd7-7249-0c72","error":"Entry Not Found","guid":"99008969-8540-4dd7-7249-0c72","session":"13.4.1.1"}} {"timestamp":"2023-10-25T09:46:38.264401894Z","level":"info","source":"guardian","message":"guardian.destroy.start","data":{"handle":"99008969-8540-4dd7-7249-0c72","session":"28655456"}}
Steps to Reproduce
A stemcell update on the management plane is sufficient to observe this behavior. We are currently looking at the specific process that triggers this.
Environment Details
The issue is observed since upgrading from cf-deployment 29.0.0 to 30.5.0
Summary
We've observed recently that Diego Cells throw error logs (see below) once the management API is updated during a CloudFoundry upgrade. These error logs are thrown in a loop until the diego cell is update / recreated during the lifecycle. Not every Diego Cell throws these error logs initially but once the drain process starts the same can be observed in the VM where the drain is running.
Both cases trigger a prolonged drain for rep during the lifecycle which waits until the configured timeout before killing the process and proceeding with the update.
Restarting the rep process seems to fix this issue and the error logs are not thrown anymore. Furthermore, restarting rep before the drain happens also fixes the issue with the prolonged update as rep is able to exit properly.
{"timestamp":"2023-10-25T09:46:46.989872950Z","level":"error","source":"rep","message":"rep.evacuation-cleanup.delete-container.failed-to-delete-garden-container","data":{"error":"failed to cleanup bindmount artifacts","guid":"99008969-8540-4dd7-7249-0c72","session":"13.4"}} {"timestamp":"2023-10-25T09:46:46.989892741Z","level":"error","source":"rep","message":"rep.evacuation-cleanup.failed-to-delete-container","data":{"container-guid":"99008969-8540-4dd7-7249-0c72","error":"failed to cleanup bindmount artifacts","session":"13"}} {"timestamp":"2023-10-25T09:46:46.989758325Z","level":"error","source":"rep","message":"rep.evacuation-cleanup.delete-container.containerstore.destroy.node-destroy.failed-releasing-cache-key","data":{"Guid":"99008969-8540-4dd7-7249-0c72","cache-key":"buildpack-cflinuxfs3-lifecycle","dir":"/var/vcap/data/rep/shared/garden/download_cache/38b2a7ccd052cc6ca87458d02a7c6c7a-1695808881784314774-12.d","error":"Entry Not Found","guid":"99008969-8540-4dd7-7249-0c72","session":"13.4.1.1"}} {"timestamp":"2023-10-25T09:46:46.989831937Z","level":"error","source":"rep","message":"rep.evacuation-cleanup.delete-container.containerstore.destroy.failed-to-destroy-container","data":{"Guid":"99008969-8540-4dd7-7249-0c72","error":"failed to cleanup bindmount artifacts","guid":"99008969-8540-4dd7-7249-0c72","session":"13.4.1"}} {"timestamp":"2023-10-25T09:46:46.989770441Z","level":"error","source":"rep","message":"rep.evacuation-cleanup.delete-container.containerstore.destroy.node-destroy.failed-to-release-cached-deps","data":{"Guid":"99008969-8540-4dd7-7249-0c72","error":"Entry Not Found","guid":"99008969-8540-4dd7-7249-0c72","session":"13.4.1.1"}} {"timestamp":"2023-10-25T09:46:38.264401894Z","level":"info","source":"guardian","message":"guardian.destroy.start","data":{"handle":"99008969-8540-4dd7-7249-0c72","session":"28655456"}}
Steps to Reproduce
A stemcell update on the management plane is sufficient to observe this behavior. We are currently looking at the specific process that triggers this.
Environment Details
The issue is observed since upgrading from cf-deployment 29.0.0 to 30.5.0
name: capi version: 1.152.0 - version: 1.153.0 + name: diego version: 2.76.0 - version: 2.78.0 + name: garden-runc version: 1.29.0 - version: 1.33.0 +
Additional information
Further information: https://cloudfoundry.slack.com/archives/C2U7KA7M4/p1693997791135449