cloudfoundry / diego-release

BOSH Release for Diego
Apache License 2.0
201 stars 213 forks source link

Rep unable to removed cached items after management API restart #852

Closed brunograz closed 3 months ago

brunograz commented 10 months ago

Summary

We've observed recently that Diego Cells throw error logs (see below) once the management API is updated during a CloudFoundry upgrade. These error logs are thrown in a loop until the diego cell is update / recreated during the lifecycle. Not every Diego Cell throws these error logs initially but once the drain process starts the same can be observed in the VM where the drain is running.

Both cases trigger a prolonged drain for rep during the lifecycle which waits until the configured timeout before killing the process and proceeding with the update.

Restarting the rep process seems to fix this issue and the error logs are not thrown anymore. Furthermore, restarting rep before the drain happens also fixes the issue with the prolonged update as rep is able to exit properly.

{"timestamp":"2023-10-25T09:46:46.989872950Z","level":"error","source":"rep","message":"rep.evacuation-cleanup.delete-container.failed-to-delete-garden-container","data":{"error":"failed to cleanup bindmount artifacts","guid":"99008969-8540-4dd7-7249-0c72","session":"13.4"}} {"timestamp":"2023-10-25T09:46:46.989892741Z","level":"error","source":"rep","message":"rep.evacuation-cleanup.failed-to-delete-container","data":{"container-guid":"99008969-8540-4dd7-7249-0c72","error":"failed to cleanup bindmount artifacts","session":"13"}} {"timestamp":"2023-10-25T09:46:46.989758325Z","level":"error","source":"rep","message":"rep.evacuation-cleanup.delete-container.containerstore.destroy.node-destroy.failed-releasing-cache-key","data":{"Guid":"99008969-8540-4dd7-7249-0c72","cache-key":"buildpack-cflinuxfs3-lifecycle","dir":"/var/vcap/data/rep/shared/garden/download_cache/38b2a7ccd052cc6ca87458d02a7c6c7a-1695808881784314774-12.d","error":"Entry Not Found","guid":"99008969-8540-4dd7-7249-0c72","session":"13.4.1.1"}} {"timestamp":"2023-10-25T09:46:46.989831937Z","level":"error","source":"rep","message":"rep.evacuation-cleanup.delete-container.containerstore.destroy.failed-to-destroy-container","data":{"Guid":"99008969-8540-4dd7-7249-0c72","error":"failed to cleanup bindmount artifacts","guid":"99008969-8540-4dd7-7249-0c72","session":"13.4.1"}} {"timestamp":"2023-10-25T09:46:46.989770441Z","level":"error","source":"rep","message":"rep.evacuation-cleanup.delete-container.containerstore.destroy.node-destroy.failed-to-release-cached-deps","data":{"Guid":"99008969-8540-4dd7-7249-0c72","error":"Entry Not Found","guid":"99008969-8540-4dd7-7249-0c72","session":"13.4.1.1"}} {"timestamp":"2023-10-25T09:46:38.264401894Z","level":"info","source":"guardian","message":"guardian.destroy.start","data":{"handle":"99008969-8540-4dd7-7249-0c72","session":"28655456"}}

Steps to Reproduce

A stemcell update on the management plane is sufficient to observe this behavior. We are currently looking at the specific process that triggers this.

Environment Details

The issue is observed since upgrading from cf-deployment 29.0.0 to 30.5.0

name: capi version: 1.152.0 - version: 1.153.0 + name: diego version: 2.76.0 - version: 2.78.0 + name: garden-runc version: 1.29.0 - version: 1.33.0 +

Additional information

Further information: https://cloudfoundry.slack.com/archives/C2U7KA7M4/p1693997791135449

MarcPaquette commented 4 months ago

Hi @brunograz, are you still experiencing this issue with the latest versions of Diego, CAPI and Garden-runc?

MarcPaquette commented 3 months ago

Closing this issue due to lack of response. Feel free to reopen if issue persists.