CC Workers can get blocked when jobs in the queue block for a long time

petergtz commented 5 years ago

Issue

CC Workers can get blocked when jobs in the queue block for a long time.

Context

We've seen CATs failures, because staging took longer than 240s. It took longer than 240s, because CC Workers could not complete the staging jobs fast enough. They could not complete staging jobs fast enough, because they were busy completing other jobs. They were specifically busy with DeleteJobs deleting droplets in the blobstore. Deleting droplets from the blobstore is normally very fast. However, in GCP (RelInts hagrid env) we saw GCP response times of 360s! That means that those CC Worker jobs were blocked for 6 minutes. Even with multiple CC Workers, this can eventually block the whole job queue and other jobs are delayed to a point where staging is heavily impacted.

We saw the same issues of clogging the whole job queue in IBM some months ago, when the IBM Cloud Object Storage had degraded performance.

We noticed this in particular when using the Bits-Service as a blobstore.

Steps to Reproduce

It's quite difficult to reproduce that in fact. RelInt team has an environment (hagrid) which can provoke the situation when running CATs multiple times. Usually, the queue gets clogged on every other CAT run.

Expected result

Staging takes about a minute.

Current result

Staging takes more than 4 minutes.

Possible Fix

While we can introduce timeouts both in the Bits-Service and Bits-Service client gem, the CC Workers are the right layer to make the decision when to cancel a job and re-schedule it. More concretely, the BlobstoreDelete job should use a timeout to defend itself against slow Deletes. Deletes should usually not take more than 10s (conservatively), in 99% less than 1s.

cf-gitbot commented 5 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/161752774

The labels on this github issue will be updated when the story is started.

charleshansen commented 5 years ago

Interesting problem. I don't have any cheap ideas for a good "Cancel and Reschedule Jobs" process in the workers, but timeouts on BlobstoreDelete sound safe to me. We have nightly jobs to clean up orphaned blobs, so it is ok if a delete fails here or there. Can you think of any instance where a blob would take more than 10 seconds to delete many nights in a row?

petergtz commented 5 years ago

@charleshansen

Can you think of any instance where a blob would take more than 10 seconds to delete many nights in a row?

No, I think this is indeed very unlikely.

charleshansen commented 5 years ago

I just discovered this is already configurable, https://github.com/cloudfoundry/capi-release/blob/8d84509d049ca4b9605579f0a2e8295f5f5cecc0/jobs/cloud_controller_worker/spec#L96

@petergtz can you try setting the blobstore delete timeout and see if it fixes your problem? You may want to set it in all three places (cloud_controller_ng, cloud_controller_worker, cloud_controller_clock) just to be sure.

petergtz commented 5 years ago

@charleshansen Yep, this looks pretty much like what I was looking for. Thanks very much for finding this. Closing this for now as it seems to be solving our issue.

cf-gitbot commented 5 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/162549000

The labels on this github issue will be updated when the story is started.

petergtz commented 5 years ago

@charleshansen Looking at this once more: would it make sense to use a better default for deletes? It currently falls back to the global timeout of 4h. For deletes, this is hardly ever a good default. 60s seem to be more appropriate and still conservative, no?

Unless of course this is also used to delete "folders" in the blobstore. These could take a while depending on how many files are in there. Wondering if there is a way to still come up with better defaults for such operations.

charleshansen commented 5 years ago

Agreed there might be a much better default. I don't think we ever use the job to delete folders, so it shouldn't be too bad.

My main worry is someone has uploaded a very large file on a very slow blobstore, and it just takes 5 minutes to delete it or something crazy. I have no idea how slow this is in the worst case. In that situation, if we set the default too low, the users now have a very large orphaned blob that will never be cleaned, and they may accumulate a lot of these if it is from a particular app that is pushed often.

To do this right, you would need some kind of timeout backoff behavior, but that would not be worth the cost to implement. Some potentially cheaper ideas:

Blobstore delete uses a fast default timeout in normal circumstances (e.g. when users delete droplets), but uses a much longer timeout during the nightly orphan blob cleanup.
The default delete timeout is set conservatively, but much lower than 4 hours, but there is some kind of logging or metric to indicate to the operator that a blob has repeatedly failed to delete because of this timeout.

I'm rolling off the team soon, so @zrob , @ssisil , or @tcdowney may have to make the final call on this one.

charleshansen commented 5 years ago

Forgot to add, there is another option: Blobstore delete jobs are marked as low priority, so they don't clog up the workers as much.

petergtz commented 5 years ago

My main worry is someone has uploaded a very large file on a very slow blobstore, and it just takes 5 minutes to delete it or something crazy. I have no idea how slow this is in the worst case. In that situation, if we set the default too low, the users now have a very large orphaned blob that will never be cleaned, and they may accumulate a lot of these if it is from a particular app that is pushed often.

I think this can't happen: Should some storage provider get a DELETE request and start working on it and then the client cuts off the connection, the deletion will already be in progress. There is no way this file could be recovered in the middle by a storage provider. Let's assume the file is very very big and sharded across multiple machines, I'd assume that the first thing is that this file gets marked as deleted, then the individual shards get removed. It doesn't matter if this happens synchronously or asynchronously.

So now, when the CC job tries again to delete this file, the storage provider will simply it's already gone.

Forgot to add, there is another option: Blobstore delete jobs are marked as low priority, so they don't clog up the workers as much.

Yes, sounds like a viable option too.

BTW, https://github.com/cloudfoundry/cf-deployment/issues/678 is related and could be used to verify that whatever change is made in CC actually works.

petergtz commented 5 years ago

Hey @zrob , @ssisil , or @tcdowney,

Could you please have a look at the above conversation and specifically at the proposals @charleshansen made. This would defend the CC workers much better against intermittent slow requests with GCP or Azure backends. See also https://cloudfoundry.slack.com/archives/C0BNGJY0G/p1547220685031100.

While we can do some mitigation in the Bits-Service, overall the tasks are a better place to make decisions about timeout, retry or priority.

Thanks, Peter

Gerg commented 4 years ago

I'm a bit confused about the original issue described here. I understand that the BlobstoreDelete job can fill up the worker queue if the blobstore is being slow, but I don't think it was the cause of the staging timeouts you saw. Staging is done on Diego cells, which are not affected by the CC worker queue. Similarly, file uploads are handled by CC local workers, which pull from different job queues.

As a demonstration, I stopped the cc workers and was able to successfully push an app:

bosh vms | grep cc-worker
cc-worker/c5410de2-b248-4db6-a9bf-3ed36b5b2a7f              stopped ...

cf6 push workerless
Pushing app workerless to org org / space space as admin...
Getting app info...
Creating app with these attributes...
+ name:       workerless
  path:       ...github.com/cloudfoundry/cf-acceptance-tests/assets/dora
  routes:
+   workerless.butter-koala.lite.cli.fun

Creating app workerless...
Mapping routes...
Comparing local files to remote cache...
Packaging files to upload...
Uploading files...
 233.30 KiB / 233.30 KiB [============================================================================================================================================] 100.00% 1s

Waiting for API to complete processing files...

Staging app and tracing logs...
   Downloading binary_buildpack...
   ...

Waiting for app to start...

name:              workerless
requested state:   started
routes:            ...
last uploaded:     Mon 17 Aug 11:46:28 PDT 2020
stack:             cflinuxfs3
buildpacks:        ruby

type:            web
instances:       1/1
memory usage:    256M
start command:   bundle exec rackup config.ru -p $PORT
     state     since                  cpu    memory      disk      details
#0   running   2020-08-17T18:46:40Z   0.0%   0 of 256M   0 of 1G

I'm cautious about messing with the baseline blobstore delete timeout since we've seen blobstores:

Behave very differently, even when claiming "s3 compatibility"
Behave slowly, often intermittently

Operators will have a better sense of their blobstore's characteristics and can configure cc.jobs.blobstore_delete.timeout_in_seconds as is appropriate for their environment.

Gerg commented 4 years ago

That said, the new v7 CLI push implementation depends on the POST /v3/spaces/:guid/actions/apply_manifest endpoint, which does depend on the cc-workers. So a backed up job queue will block pushing using the v7 CLI.

To get past this (regardless of what job is clogging the queue), we may want to increase the priority of ApplyManifestActionJob.

petergtz commented 4 years ago

@Gerg Sorry, I'm not working on Cloud Foundry anymore. So unfortunately, I cannot provide more details.

weymanf commented 3 years ago

The tracker story has changed to here: https://www.pivotaltracker.com/n/projects/966314/stories/174482113

cloudfoundry / cloud_controller_ng