cloudfoundry / korifi

Cloud Foundry on Kubernetes
Apache License 2.0
317 stars 65 forks source link

[Chore] Fix list spaces E2E flake #1291

Closed danail-branekov closed 1 year ago

danail-branekov commented 2 years ago

We have been seeing frequent flakes in the list spaces E2E test, both locally and on CI

https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/365 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/370 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/345 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/343 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/329 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/224 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-pr/builds/167 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-pr/builds/161 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/176 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/181 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/167 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-periodic/builds/144 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-pr/builds/139 https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-main/builds/21.1

• [FAILED] [2.642 seconds]
Spaces
/tmp/build/3e139e9d/korifi/tests/e2e/spaces_test.go:19
  list [BeforeEach]
  /tmp/build/3e139e9d/korifi/tests/e2e/spaces_test.go:124
    when filtering by name
    /tmp/build/3e139e9d/korifi/tests/e2e/spaces_test.go:230
      only lists those matching and available
      /tmp/build/3e139e9d/korifi/tests/e2e/spaces_test.go:237

  Begin Captured StdOut/StdErr Output >>
    W0630 14:58:53.890389     531 gcp.go:120] WARNING: the gcp auth plugin is deprecated in v1.22+, unavailable in v1.25+; use gcloud instead.
    To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
  << End Captured StdOut/StdErr Output

  Begin Captured GinkgoWriter Output >>
    Fail Handler: failure correlation ID: "11dc503b-01ae-4ce5-b4cb-49a4cdebd722"

    ===== Logs for pod "korifi-api-deployment-b47494658-99qzk" with correlation ID "11dc503b-01ae-4ce5-b4cb-49a4cdebd722" (last 50 lines) =====
    No relevant logs found
    ==============================================

    ===== Logs for pod "korifi-controllers-controller-manager-5fd7984d58-kqpvl" (last 50 lines) =====
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:320
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:273
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:234
    1.6566011338793697e+09  INFO    controllers.CFSpace Secret cf-space-5f3702f7-86f6-483c-bb7b-ec2d1df04262/image-registry-credentials created
    1.6566011338940556e+09  DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/validate-korifi-cloudfoundry-org-v1alpha1-cfapp", "UID": "9f67eb72-4a8d-4572-9766-5a364279cc25", "kind": "korifi.cloudfoundry.org/v1alpha1, Kind=CFApp", "resource": {"group":"korifi.cloudfoundry.org","version":"v1alpha1","resource":"cfapps"}}
    1.6566011338950639e+09  INFO    controllers.CFSpace Role Binding cf-space-5f3702f7-86f6-483c-bb7b-ec2d1df04262/cf-73f5dbc77ac589e429648f8540cf03ef2a440d2b3a0aba94b3dfdf9e3506c61d created
    1.6566011339118817e+09  INFO    cfapp-validate.duplicateValidator.ValidateDelete    failed to deregister name during delete {"error": "deleting a lease failed: leases.coordination.k8s.io \"n-3b137e9af0c7d9df87a71055d2f93fb68a1ed547\" not found", "namespace": "cf-space-265c32e1-1e22-4f78-8eee-44fc30a6bc85", "name": "app-2fa2ed9a-151f"}
    1.6566011339119818e+09  DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/validate-korifi-cloudfoundry-org-v1alpha1-cfapp", "code": 200, "reason": "", "UID": "9f67eb72-4a8d-4572-9766-5a364279cc25", "allowed": true}
    1.6566011339297414e+09  INFO    controllers.CFSpace Role Binding cf-space-5f3702f7-86f6-483c-bb7b-ec2d1df04262/default-admin-binding created
    1.656601133930024e+09   INFO    controllers.CFApp   Reconciling deletion of CFApp/2fc24318-b3c7-455a-a1dc-2263f78d5327
    1.6566011339419713e+09  DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/mutate-korifi-cloudfoundry-org-v1alpha1-cfapp", "UID": "d67a6eab-0ac4-4a30-91dc-aa4f172f5f1b", "kind": "korifi.cloudfoundry.org/v1alpha1, Kind=CFApp", "resource": {"group":"korifi.cloudfoundry.org","version":"v1alpha1","resource":"cfapps"}}
    1.6566011339422295e+09  INFO    cfapp-resource  Mutating CFApp webhook handler  {"name": "2fc24318-b3c7-455a-a1dc-2263f78d5327"}
    1.6566011339425893e+09  DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/mutate-korifi-cloudfoundry-org-v1alpha1-cfapp", "code": 200, "reason": "", "UID": "d67a6eab-0ac4-4a30-91dc-aa4f172f5f1b", "allowed": true}
    1.656601133944406e+09   INFO    controllers.CFSpace Role Binding cf-space-5f3702f7-86f6-483c-bb7b-ec2d1df04262/eirini-controller-workload-rolebinding created
    1.656601133955734e+09   DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/validate-korifi-cloudfoundry-org-v1alpha1-cfapp", "UID": "a628c55e-c53c-403c-b3ff-4cff7f7231bf", "kind": "korifi.cloudfoundry.org/v1alpha1, Kind=CFApp", "resource": {"group":"korifi.cloudfoundry.org","version":"v1alpha1","resource":"cfapps"}}
    1.656601133956766e+09   DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/validate-korifi-cloudfoundry-org-v1alpha1-cfapp", "code": 200, "reason": "", "UID": "a628c55e-c53c-403c-b3ff-4cff7f7231bf", "allowed": true}
    1.6566011339679718e+09  ERROR   controllers.CFApp   unable to fetch CFApp   {"error": "CFApp.korifi.cloudfoundry.org \"2fc24318-b3c7-455a-a1dc-2263f78d5327\" not found"}
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:121
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:320
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:273
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:234
    1.6566011339720857e+09  INFO    controllers.CFSpace Namespace/cf-space-8fe13471-5095-4272-8bea-5bb2fb13917f unchanged
    1.6566011339726207e+09  INFO    controllers.CFSpace Secret cf-space-8fe13471-5095-4272-8bea-5bb2fb13917f/image-registry-credentials unchanged
    1.6566011339729743e+09  INFO    controllers.CFSpace Role Binding cf-space-8fe13471-5095-4272-8bea-5bb2fb13917f/default-admin-binding unchanged
    1.6566011339732072e+09  INFO    controllers.CFSpace Role Binding cf-space-8fe13471-5095-4272-8bea-5bb2fb13917f/eirini-controller-workload-rolebinding unchanged
    1.656601133987207e+09   INFO    controllers.CFSpace service account already exists, skipping    {"error": "serviceaccounts \"kpack-service-account\" already exists"}
    1.6566011339934998e+09  INFO    controllers.CFSpace service account already exists, skipping    {"error": "serviceaccounts \"eirini\" already exists"}
    1.6566011339987369e+09  INFO    controllers.CFSpace Namespace/cf-space-5f3702f7-86f6-483c-bb7b-ec2d1df04262 unchanged
    1.6566011339991918e+09  INFO    controllers.CFSpace Secret cf-space-5f3702f7-86f6-483c-bb7b-ec2d1df04262/image-registry-credentials unchanged
    1.6566011339995353e+09  INFO    controllers.CFSpace Role Binding cf-space-5f3702f7-86f6-483c-bb7b-ec2d1df04262/cf-73f5dbc77ac589e429648f8540cf03ef2a440d2b3a0aba94b3dfdf9e3506c61d unchanged
    1.656601133999732e+09   INFO    controllers.CFSpace Role Binding cf-space-5f3702f7-86f6-483c-bb7b-ec2d1df04262/default-admin-binding unchanged
    1.6566011339999304e+09  INFO    controllers.CFSpace Role Binding cf-space-5f3702f7-86f6-483c-bb7b-ec2d1df04262/eirini-controller-workload-rolebinding unchanged
    1.656601134008658e+09   INFO    controllers.CFSpace service account already exists, skipping    {"error": "serviceaccounts \"kpack-service-account\" already exists"}
    1.6566011340151427e+09  INFO    controllers.CFSpace service account already exists, skipping    {"error": "serviceaccounts \"eirini\" already exists"}
    1.656601134026048e+09   ERROR   controllers.CFProcess   Error when trying to fetch CFProcess cf-space-265c32e1-1e22-4f78-8eee-44fc30a6bc85/cf-proc-2fc24318-b3c7-455a-a1dc-2263f78d5327web  {"error": "CFProcess.korifi.cloudfoundry.org \"cf-proc-2fc24318-b3c7-455a-a1dc-2263f78d5327web\" not found"}
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:121
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:320
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:273
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:234

    ==============================================

    Fail Handler: completed
  << End Captured GinkgoWriter Output

  unexpected error occurred while creating spaces: expected status code 201, got 503
  Expected
      <chan error | len:3, cap:9>: 0xc000a1d9e0
  not to receive anything.
  In [BeforeEach] at: /go/pkg/mod/github.com/onsi/gomega@v1.19.0/internal/assertion.go:63

We should fix them

georgethebeatle commented 2 years ago

This flake is slightly similar to https://github.com/cloudfoundry/korifi/issues/1310 since the api shim returns a 503 on org createion. The main difference here is that this happens 100% on async org creation.

Like #1310 there is no evidence of very high load and no information about potential api shim restarts

julian-hj commented 2 years ago

I think this might be the same root cause as #1310

georgethebeatle commented 2 years ago

I don't think that's what's really going on, since we looked at gcloud console and found no evidence of the shim being restarted or crashed (crash counts were zero). I think the DPanidOnBugs setting is somehow not switched on in our shim. As a matter of facts I was able to observe the poorly formatted log message right after the dpanic. And we observed this flake after the dpanic fix, so we are after something else here...

georgethebeatle commented 2 years ago

We found these interesting logs in the envoy proxy logs:

" "-"
[2022-07-08T09:33:15.932Z] "GET /v3/droplets/670c162a-2617-45af-8841-8daa8105cca9 HTTP/2" 503 UF 0 87 2 - 34.140.146.235" "go-resty/2.7.0 (https://github.com/go-resty/resty)" "0814f7ea-bb18-4386-8f44-b3f2a8c4fa8c" "cf.pr-e2e.korifi.cf-app.com" "10.100.0.60:9000"
[2022-07-08T09:33:15.927Z] "POST /v3/spaces HTTP/2" 503 UF 129 87 8 - "34.140.146.235" "go-resty/2.7.0 (https://github.com/go-resty/resty)" "be35f2dc-6260-4a6b-81ae-1ec67eb044a4" "cf.pr-e2e.korifi.cf-app.com" "10.100.0.60:9000"
[2022-07-08T09:33:15.926Z] "POST /v3/spaces HTTP/2" 503 UF 129 87 12 - "34.140.146.235" "go-resty/2.7.0 (https://github.com/go-resty/resty)" "aae629aa-ac7f-4df7-8bbd-4b1a7e765c02" "cf.pr-e2e.korifi.cf-app.com" "10.100.0.60:9000"
[2022-07-08T09:33:15.924Z] "POST /v3/spaces HTTP/2" 503 UF 129 87 15 - "34.140.146.235" "go-resty/2.7.0 (https://github.com/go-resty/resty)" "c7104b65-585c-4123-bba2-129d066bb190" "cf.pr-e2e.korifi.cf-app.com" "10.100.0.60:9000"

Thie timestamps on these correlate nicely to the time of one of the last flakes: https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-pr/builds/209.2#L62c50f4e:165:169

There are three EOF stderr logs in this build at the same time https://ci.korifi.cf-app.com/teams/main/pipelines/main/jobs/run-e2es-pr/builds/209.2#L62c50f4e:74:76 which are probably related to the envoy 503s.

This means that the envoy proxy is closing those connections for some reason. It's worth noting that these requests happen in parallel so we might be hitting some limits of the server. Unfortunately we haven't been able to correlate one of these failures with particularly high load of the node.

We have added a task catting the envoy logs on test failure in order to find out if we will always get these logs.

georgethebeatle commented 1 year ago

Closing due to inactivity. Let's reopen this issue or create a new one if we encounter the flake again