`"cleanUp": false` not always respected

mbmccoy commented 2 years ago

Type: Bug

I've got a deployment with several containers, including persistent volume claims that I'd like to be, well, persistent. I've set the launch.json as follows:

        {
            "name": "Kubernetes: Run/Debug - cloudbuild",
            "type": "cloudcode.kubernetes",
            "request": "launch",
            "skaffoldConfig": "${workspaceFolder}/path/to/skaffold.yaml",
            "profile": "cloudbuild",
            "watch": false,
            "cleanUp": false,
            "portForward": true,
            "internalConsoleOptions": "neverOpen",
            "imageRegistry": "gcr.io/xxxxxxx-xxxxxx"
        }

Nevertheless, the system regularly "cleans up" for me, including deleting the persistent volumes. Here's the redacted output:

Cleaning up...
 - storageclass.storage.k8s.io "regionalpd-storageclass" deleted
 - persistentvolumeclaim "data-pv" deleted
 - deployment.apps "postgres" deleted
 - persistentvolumeclaim "postgresql-pv" deleted
 - service "postgres" deleted
 - backendconfig.cloud.google.com "config-default" deleted
 - deployment.apps "webserver-frontend" deleted
 - ingress.networking.k8s.io "webserver-ingress" deleted
 - service "webserver-frontend" deleted
Cleanup completed in 800.663838ms
1/2 deployment(s) failed
Skaffold exited with code 1.

I seem to be able to trigger this several ways, but this is pretty reliable:

Deploy containers succesfully with cloudcode.
Stop the cloudcode "run" using the stop button. (This does not clean up!).
Use kubectl to delete a deployment (in this case, kubectl delete deployment webserver-frontend).
Attempt to redeploy with cloudcode.

Context

Managed Dependencies: on
Cloud SDK Version: 405.0.0
Skaffold Version: v1.39.2
Minikube Version: 1.27.0

Redacted system info:

CPUs    Apple M1 (8 x 24)
GPU Status  2d_canvas: enabled canvas_oop_rasterization: disabled_off direct_rendering_display_compositor: disabled_off_ok gpu_compositing: enabled metal: disabled_off multiple_raster_threads: enabled_on opengl: enabled_on rasterization: enabled raw_draw: disabled_off_ok skia_renderer: enabled_on video_decode: enabled video_encode: enabled vulkan: disabled_off webgl: enabled webgl2: enabled webgpu: disabled_off
Load (avg)  2, 2, 2
Memory (System) 8.00GB (0.03GB free)
Process Argv    --crash-reporter-id 47721791-6124-4b7d-9d51-e3948fcb2ce9
Screen Reader   no
VM  0%
Remote  SSH: <REDACTED>
OS  Linux x64 4.19.0-21-cloud-amd64
CPUs    Intel(R) Xeon(R) CPU @ 2.20GHz (4 x 2199)
Memory (System) 25.51GB (21.22GB free)
VM  0%

Extension version: 1.20.3 VS Code version: Code 1.71.2 (74b1f979648cc44d385a2286793c226e611f59e7, 2022-09-14T21:05:37.721Z) OS version: Darwin x64 21.3.0 Modes: Sandboxed: No Remote OS version: Linux x64 4.19.0-21-cloud-amd64

A/B Experiments

``` vsliv368:30146709 vsreu685:30147344 python383cf:30185419 vspor879:30202332 vspor708:30202333 vspor363:30204092 vstes516:30244333 vslsvsres303:30308271 pythonvspyl392:30443607 vserr242cf:30382550 pythontb:30283811 vsjup518:30340749 pythonptprofiler:30281270 vshan820:30294714 vstes263:30335439 pythondataviewer:30285071 vscod805:30301674 binariesv615:30325510 bridge0708:30335490 bridge0723:30353136 cmake_vspar411:30581797 vsaa593:30376534 pythonvs932:30410667 cppdebug:30492333 vsclangdc:30486549 c4g48928:30535728 dsvsc012cf:30540253 azure-dev_surveyone:30548225 i497e931:30553904 pyindex848cf:30577861 40g7c324:30573242 ```

ChaseMor commented 2 years ago

Interesting. So the Cleaning up... log runs when you are deploying the second time?

Also, you say that is the most reliable way to produce this, are there times in which this doesn't repro or does it always behave this way?

mbmccoy commented 2 years ago

So the Cleaning up... log runs when you are deploying the second time?

Yes. To be clear, this seems happens when the deployment fails—after cleaning up, the deployment attempt has stopped. Perhaps there is something in the post-deployment logic that cleans up when there is a deployment failure which doesn't respect the cleanUp: false option?

You say that is the most reliable way to produce this, are there times in which this doesn't repro or does it always behave this way?

I've always seen this repro after deleting the webserver-frontend deployment (as I described), which seems to trigger a re-deployment failure and subsequent cleanUp. But I've also seen this sporadically when a container fails to start upon deployment, where it "cleans up" after a failure.

Obviously, this is mostly an issue when it deletes my persistent volume claims; I'm using these specifically because they have lifecycles that are longer than the deployments.

ChaseMor commented 2 years ago

So, the logs for cleaning up come from Skaffold which Cloud Code calls internally. Here's a bug tracking skaffold deleting PVCs when they shouldn't which might have some useful ideas for what to do here. https://github.com/GoogleContainerTools/skaffold/issues/4366

I think for your specific case, there might be a way to configure your Persistent Volume to be claimed by something manually or have a claim outside of the deployment because to my understanding, the Persistent Volume will defaultly be deleted if there are no claims for it.

But it does look like Cloud Code is not invoking skaffold with the cleanUp flag correctly.

GoogleCloudPlatform / cloud-code-vscode

`"cleanUp": false` not always respected #658