cloudfoundry-incubator / kubecf

Cloud Foundry on Kubernetes
Apache License 2.0
115 stars 62 forks source link

kubecf CI failures #558

Closed bikramnehra closed 4 years ago

bikramnehra commented 4 years ago

Describe the bug The kubecf pipeline is failing intermittently with following error message:

secret/susecf-scf.var-cf-admin-password created
Error: unable to build kubernetes objects from release manifest: [unable to recognize "": no matches for kind "BOSHDeployment" in version "quarks.cloudfoundry.org/v1alpha1", unable to recognize "": no matches for kind "QuarksJob" in version "quarks.cloudfoundry.org/v1alpha1", unable to recognize "": no matches for kind "QuarksSecret" in version "quarks.cloudfoundry.org/v1alpha1", unable to recognize "": no matches for kind "QuarksStatefulSet" in version "quarks.cloudfoundry.org/v1alpha1"]
make[1]: *** [Makefile:17: install] Error 1
make: *** [Makefile:180: scf] Error 2

https://concourse.suse.dev/teams/main/pipelines/kubecf/jobs/deploy-diego/builds/268

To Reproduce Trigger a kubecf build and this error will appear intermittently.

Expected behavior CI to pass without such failures.

Environment

Additional context NA

bikramnehra commented 4 years ago

@mudler @jimmykarily can we do something about this issue, its becoming a bottleneck. https://concourse.suse.dev/teams/main/pipelines/kubecf/jobs/deploy-diego/builds/302

gaktive commented 4 years ago

@viovanov @f0rmiga Heads up.

gaktive commented 4 years ago

@bikramnehra how often is this happening? Can you pinpoint this to a potential part of the hardware this CI runs on?

bikramnehra commented 4 years ago

Me and @mudler while having an initial look anticipated that its likely that the operator was not ready yet when the pipeline started deploying kubecf. But would need further investigation(which is becoming hard as the cluster goes away).

I tried reproducing this locally but no luck there :(

I have seen this happening relatively frequently, e.g: build 302, 301 and many previous builds are having this problem.

I am not entirely sure if the problem can be attributed to hardware, we have seen following issues which were clearly hardware related:

[./up.sh] [backend:ekcp] [cluster:kubecf-eirini-1585111658-1f94e87] Loading
{"Output":"","AvailableClusters":null,"Clusters":null,"ActiveEndpoints":null,"ClusterIPs":null,"LocalClusters":null,"Error":"No available resources"}/tmp/build/b0d51b9f/catapult/buildkubecf-eirini-1585111658-1f94e87 /tmp/build/b0d51b9f/catapult/backend/ekcp
[./kubeconfig.sh] [backend:ekcp] [cluster:kubecf-eirini-1585111658-1f94e87] Loading
jimmykarily commented 4 years ago

The last error No available resources should not happen after #561 was merged. At least if we don't try to squeeze too many clusters in the hardware available. Nothing to say about the reported issue though, needs investigation.

mudler commented 4 years ago

Hard to reproduce locally ( I didn't managed ) but I think https://github.com/SUSE/catapult/pull/148 might help, it definitely worth a shot imho

mudler commented 4 years ago

After seeing comment from @jandubois ( https://github.com/cloudfoundry-incubator/kubecf/pull/572#issuecomment-606161513 ) I'm suspecting this is not anymore a CI issue.. just a stretched hunch that needs investigation.

I'm saying that because now Catapult does check if CRDs are in place, as you can see from the logs are indeed present:

[./install.sh] [backend:imported] [cluster:kind-MGJkY2I0Yzg3YmJjMWRmZWM0Nzk1NDdh] Wait for cf-operator to be ready

[./install.sh] [backend:imported] [cluster:kind-MGJkY2I0Yzg3YmJjMWRmZWM0Nzk1NDdh] Waiting for kubectl get endpoints -n cf-operator cf-operator-webhook -o name

endpoints/cf-operator-webhook

[./install.sh] [backend:imported] [cluster:kind-MGJkY2I0Yzg3YmJjMWRmZWM0Nzk1NDdh] Waiting for kubectl get crd quarksstatefulsets.quarks.cloudfoundry.org -o name

customresourcedefinition.apiextensions.k8s.io/quarksstatefulsets.quarks.cloudfoundry.org

[./install.sh] [backend:imported] [cluster:kind-MGJkY2I0Yzg3YmJjMWRmZWM0Nzk1NDdh] Waiting for kubectl get crd quarkssecrets.quarks.cloudfoundry.org -o name

customresourcedefinition.apiextensions.k8s.io/quarkssecrets.quarks.cloudfoundry.org

[./install.sh] [backend:imported] [cluster:kind-MGJkY2I0Yzg3YmJjMWRmZWM0Nzk1NDdh] Waiting for kubectl get crd quarksjobs.quarks.cloudfoundry.org -o name

customresourcedefinition.apiextensions.k8s.io/quarksjobs.quarks.cloudfoundry.org

[./install.sh] [backend:imported] [cluster:kind-MGJkY2I0Yzg3YmJjMWRmZWM0Nzk1NDdh] Waiting for kubectl get crd boshdeployments.quarks.cloudfoundry.org -o name

customresourcedefinition.apiextensions.k8s.io/boshdeployments.quarks.cloudfoundry.org

[./install.sh] [backend:imported] [cluster:kind-MGJkY2I0Yzg3YmJjMWRmZWM0Nzk1NDdh] cf-operator ready

secret/susecf-scf.var-cf-admin-password created

Error: unable to build kubernetes objects from release manifest: [unable to recognize "": no matches for kind "BOSHDeployment" in version "quarks.cloudfoundry.org/v1alpha1", unable to recognize "": no matches for kind "QuarksJob" in version "quarks.cloudfoundry.org/v1alpha1", unable to recognize "": no matches for kind "QuarksSecret" in version "quarks.cloudfoundry.org/v1alpha1", unable to recognize "": no matches for kind "QuarksStatefulSet" in version "quarks.cloudfoundry.org/v1alpha1"]

It might be we need to wait for other resources to be set. @viovanov any thoughts?

on the other hand seems to happen less likely now, so it might be also a (slow) flaky VM causing this. I could recreate the stack to check of out if this happens again, but I think it shouldn't happen anyway, if we detect a race condition like this, it might face the user as well.

mudler commented 4 years ago

https://github.com/kubernetes-sigs/kind/issues/762#issuecomment-521017113 and https://github.com/kubernetes/kubernetes/issues/62725 might be related

mudler commented 4 years ago

Spoke with @jimmykarily and @viovanov , trying now to deploy other quarks resources (quarks statefulset, and so on) to make sure the cf-operator picks it up before deploying KubeCF, WIP in https://github.com/SUSE/catapult/tree/wait_crd_to_be_ready

mudler commented 4 years ago

Github decided to close the issue automatically since saw the "fix" keyword in the other repo :man_facepalming: reopening, sorry for the noise

mudler commented 4 years ago

The workaround on the Catapult side of things seems to work, couldn't notice another occurrence of this issue anymore, but maybe it's too soon to speak. Shall we close and re-open if happens again?

f0rmiga commented 4 years ago

Sounds good, @mudler!