cloudfoundry / cf-for-k8s

The open source deployment manifest for Cloud Foundry on Kubernetes
Apache License 2.0
300 stars 115 forks source link

CF CLI frequently gets 'stuck' during push commands while doing concurrent pushes #588

Closed braunsonm closed 3 years ago

braunsonm commented 3 years ago

Describe the bug

We have a large number of projects which get pushed to our cf-for-k8s cluster from CI tools. We're often seeing when 2 deployments happen at the same time one of them will get stuck during the cf push phase. This doesn't affect the actual deployment. Both apps will be deployed successfully and be routable, but the CF CLI seems to just freeze and eventually timeout.

   Build successful
Starting deployment for app gateway...
Waiting for app to deploy...
Start app timeout
TIP: Application must be listening on the right port. Instead of hard coding the port, use the $PORT environment variable.
Use 'cf logs gateway --recent' for more information
FAILED

However if you open another terminal with cf logs gateway and watch the deployment. Everything happens normally and the app starts up and is internet routable.

When you rerun the CI tool it will work fine.

To Reproduce*

Steps to reproduce the behavior:

  1. Deploy a few apps at the same time
  2. Notice that sometimes cf push commands get stuck and timeout.

Expected behavior

Multiple CF Push commands should be able to happen concurrently.

Additional context

cf-for-k8s SHA

tag v1.0.0

Cluster information

AKS

CLI versions

cf version: 7.1.0

cf-gitbot commented 3 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/176001755

The labels on this github issue will be updated when the story is started.

braunsonm commented 3 years ago

Could this be because we use the rolling deployment strategy?

jamespollard8 commented 3 years ago

Oh fascinating - thanks @braunsonm for the report!

We'll prioritize this for our "Community pair" in the coming days and will first work towards trying to reproduce this ourselves. I'll also start a thread on #cf-for-k8s in case others have ideas (thread lives here)

braunsonm commented 3 years ago

Additional logs when the timeouts seem to occur: https://gist.github.com/braunsonm/4d838a1fd5d9ef4e220d3d7ee0bad15e

More details in the Slack thread

paulcwarren commented 3 years ago

Thanks for the logs @braunsonm. The community pair will look into these as soon as we are able to.

jamespollard8 commented 3 years ago

@braunsonm Have you still been seeing this frequently?

From the slack thread it looks like cf-api/CAKE folks would need to look at registry-buddy logs to see what's going on here. Think you could either send us some registry-buddy logs OR take a look at those and share what you see?

braunsonm commented 3 years ago

@jamespollard8 we believe this is still happening. It's difficult to time the timeout as you need two apps with rolling deployed at the same time. However we are seeing some pretty long deployment times during peak business hours that we think could be related to this issue still just waiting for other rolling deployments to go through instead of doing them in parallel.

Unfortunately it's hard to catch this in a busy cluster and I haven't been able to see any errors in registry buddy. (Primarily filled with the registry deletion logic)

matt-royal commented 3 years ago

Hi @braunsonm. It looks like this issue has been quiet for a while. Are you still seeing this problem?

I spent a few minutes digging into the logs you provided on the slack thread and I noticed something interesting. It seems that the build in question has no package assigned to it, which is why the patch request from the controller is failing. As far as I know this shouldn't be possible, so I'm very curious how it's happening. If you can get the build's attributes (via a get to /v3/builds/:guid) that may help us better understand what's going on.

braunsonm commented 3 years ago

Hey @matt-royal

We haven't experienced this in a bit but as I said it's pretty hard to reproduce since you have to time it just right.

Here is the API call you requested:

{
   "guid": "3099fee1-220a-4f4e-a1ef-70cd1345d46c",
   "created_at": "2020-12-01T22:25:57Z",
   "updated_at": "2020-12-01T22:26:36Z",
   "state": "STAGED",
   "error": null,
   "lifecycle": {
      "type": "kpack",
      "data": {
         "buildpacks": []
      }
   },
   "package": {
      "guid": "f0147a7e-fa8a-43b5-8e18-fcd4cc365e2d"
   },
   "droplet": {
      "guid": "b9f679ed-91f6-4a39-9587-f3cc0b1acdac"
   },
   "created_by": {
      "guid": "3c95f1f8-b521-419e-bd06-d68aca156e59",
      "name": "pipelines",
      "email": "pipelines@internal"
   },
   "relationships": {
      "app": {
         "data": {
            "guid": "897e8443-102e-4332-921d-1ff98961b280"
         }
      }
   },
   "metadata": {
      "labels": {},
      "annotations": {}
   },
   "links": {
      SNIP
   }
}
matt-royal commented 3 years ago

Thanks, @braunsonm. It looks as though the build has a package_guid. Are you able to fetch that package with cf curl /v3/packages/f0147a7e-fa8a-43b5-8e18-fcd4cc365e2d?

braunsonm commented 3 years ago
cf curl /v3/packages/f0147a7e-fa8a-43b5-8e18-fcd4cc365e2d
{
   "errors": [
      {
         "detail": "Package not found",
         "title": "CF-ResourceNotFound",
         "code": 10010
      }
   ]
}
matt-royal commented 3 years ago

@braunsonm Thank you for the additional information. This explains why we're seeing those errors in the api server and controller. What I don't understand is how the package was deleted. It shouldn't be possible to create a build without a package, so likely the package was deleted afterwards. Are there any clues in your api server logs about what happened to the package? Any other ideas about how it might have been deleted?

braunsonm commented 3 years ago

Since this issue was created so long ago I doubt I'd be able to find anything in the API server logs about the package. No idea what happened to them. We do not call the CF API server directly at all. We use the CLI for simple app pushes so my only guess is a bug somewhere along the chain.

Birdrock commented 3 years ago

@braunsonm Have you been able to reproduce this recently? If not, we'll likely close the issue; it would be reopened if this still occurs.

cc @jspawar

braunsonm commented 3 years ago

No @Birdrock not recently. Will close.

@matt-royal if you have any suggested steps to cleanup that build without packages let me know.