DataBiosphere / leonardo

Notebook service
BSD 3-Clause "New" or "Revised" License
43 stars 21 forks source link

RW-12956 Fixing a bug where PD is not cleaned up properly #4683

Closed Qi77Qi closed 2 months ago

Qi77Qi commented 2 months ago

Last week, we had some bug in SAS chart that was preventing app creation to succeed. This revealed a different bug that if an app creation fails, the newly created PD may not be cleaned up, and this causes all future app creation in the same workspace to fail because the following error

[ERROR] [06/19/2024 20:37:56.262] [leonardo-akka.actor.default-dispatcher-9] [akka.actor.ActorSystemImpl(leonardo)] HttpMethod(POST) http://notebooks.firecloud.org/api/google/v1/apps/terra-vpc-sc-9614c68a/all-of-us-4275-cromwell-6348: 500 Internal Server Error entity: {"source":"leonardo","message":"Existing disk found, but no restore info found in DB","statusCode":500,"exceptionClass":"class org.broadinstitute.dsde.workbench.leonardo.model.LeoException","traceId":"df17b4103e731dfbac5abc2ab2595fb8/16266843891970221338"}

What this PR will does is pass the diskId to the handleKubernetesError function, where it'll transition the disk status to Failed. This way, UI won't attempt to reuse this disk and run into the above exception. We might want to consider adding a step to delete PD in GCP in the error handler for future improvements.

Jira ticket: https://precisionmedicineinitiative.atlassian.net/browse/RW-12956

Summary of changes

What

-

Why

-

Testing these changes

What to test

Who tested and where

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 75.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 73.92%. Comparing base (a6893f4) to head (e87240d).

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/DataBiosphere/leonardo/pull/4683/graphs/tree.svg?width=650&height=150&src=pr&token=NZGL9e1XHh&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataBiosphere)](https://app.codecov.io/gh/DataBiosphere/leonardo/pull/4683?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataBiosphere) ```diff @@ Coverage Diff @@ ## develop #4683 +/- ## ======================================== Coverage 73.92% 73.92% ======================================== Files 160 160 Lines 14999 15000 +1 Branches 1154 1191 +37 ======================================== + Hits 11088 11089 +1 Misses 3911 3911 ``` | [Files](https://app.codecov.io/gh/DataBiosphere/leonardo/pull/4683?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataBiosphere) | Coverage Δ | | |---|---|---| | [...ch/leonardo/http/service/LeoAppServiceInterp.scala](https://app.codecov.io/gh/DataBiosphere/leonardo/pull/4683?src=pr&el=tree&filepath=http%2Fsrc%2Fmain%2Fscala%2Forg%2Fbroadinstitute%2Fdsde%2Fworkbench%2Fleonardo%2Fhttp%2Fservice%2FLeoAppServiceInterp.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataBiosphere#diff-aHR0cC9zcmMvbWFpbi9zY2FsYS9vcmcvYnJvYWRpbnN0aXR1dGUvZHNkZS93b3JrYmVuY2gvbGVvbmFyZG8vaHR0cC9zZXJ2aWNlL0xlb0FwcFNlcnZpY2VJbnRlcnAuc2NhbGE=) | `87.08% <100.00%> (+0.01%)` | :arrow_up: | | [.../leonardo/monitor/LeoPubsubMessageSubscriber.scala](https://app.codecov.io/gh/DataBiosphere/leonardo/pull/4683?src=pr&el=tree&filepath=http%2Fsrc%2Fmain%2Fscala%2Forg%2Fbroadinstitute%2Fdsde%2Fworkbench%2Fleonardo%2Fmonitor%2FLeoPubsubMessageSubscriber.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataBiosphere#diff-aHR0cC9zcmMvbWFpbi9zY2FsYS9vcmcvYnJvYWRpbnN0aXR1dGUvZHNkZS93b3JrYmVuY2gvbGVvbmFyZG8vbW9uaXRvci9MZW9QdWJzdWJNZXNzYWdlU3Vic2NyaWJlci5zY2FsYQ==) | `73.98% <50.00%> (ø)` | | ------ [Continue to review full report in Codecov by Sentry](https://app.codecov.io/gh/DataBiosphere/leonardo/pull/4683?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataBiosphere). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataBiosphere) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://app.codecov.io/gh/DataBiosphere/leonardo/pull/4683?dropdown=coverage&src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataBiosphere). Last update [a6893f4...e87240d](https://app.codecov.io/gh/DataBiosphere/leonardo/pull/4683?dropdown=coverage&src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataBiosphere). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataBiosphere).
Qi77Qi commented 2 months ago

@lucymcnatt @cindy-broadinstitute I updated the PR a bit..the previous location where I added the diskId isn't correct given current implementation of the handleKubernetesError and cleanUpAfterCreateAppError. I added a comment to clarify.

I added the diskId when disk fails to detach instead because this is what I see in BEE after I launch an RStudio with bad helm chart

{"@timestamp":"2024-07-02T15:48:52.819519274Z","@version":"1","logger_name":"org.broadinstitute.dsde.workbench.leonardo.http.Boot","traceId":"90134d7c-631e-4dcc-ac8b-9d6044b5b2a6","version":"aee05698f","thread_name":"io-compute-blocker-100","stack_trace":"org.broadinstitute.dsde.workbench.leonardo.monitor.PubsubHandleMessageError$PubsubKubernetesError: An error occurred with a kubernetes operation from source disk during action deleteApp. \nOriginal message: The data disk failed to detach within the time limit, cannot proceed with delete disk\n","severity":"ERROR","serviceContext":{"service":"leonardo","version":"aee05698f"},"message":"An error occurred during resource clean up for app AppName(qi-rstudio-7-2-0) in project terra-quality-d7a5f18a"}

Basically, the disk wasn't cleaned up becuz it failed to detach. This PR will make is such that when disk fail to detach, the disk status will be marked as FAILED in handleKubernetesError instead of remaining in Ready status. There might be some issue in how we determine disk is detached today worth looking into as well.