Closed bikramnehra closed 4 years ago
@mudler @jimmykarily can we do something about this issue, its becoming a bottleneck. https://concourse.suse.dev/teams/main/pipelines/kubecf/jobs/deploy-diego/builds/302
@viovanov @f0rmiga Heads up.
@bikramnehra how often is this happening? Can you pinpoint this to a potential part of the hardware this CI runs on?
Me and @mudler while having an initial look anticipated that its likely that the operator was not ready yet when the pipeline started deploying kubecf. But would need further investigation(which is becoming hard as the cluster goes away).
I tried reproducing this locally but no luck there :(
I have seen this happening relatively frequently, e.g: build 302, 301 and many previous builds are having this problem.
I am not entirely sure if the problem can be attributed to hardware, we have seen following issues which were clearly hardware related:
[./up.sh] [backend:ekcp] [cluster:kubecf-eirini-1585111658-1f94e87] Loading
{"Output":"","AvailableClusters":null,"Clusters":null,"ActiveEndpoints":null,"ClusterIPs":null,"LocalClusters":null,"Error":"No available resources"}/tmp/build/b0d51b9f/catapult/buildkubecf-eirini-1585111658-1f94e87 /tmp/build/b0d51b9f/catapult/backend/ekcp
[./kubeconfig.sh] [backend:ekcp] [cluster:kubecf-eirini-1585111658-1f94e87] Loading
The last error No available resources
should not happen after #561 was merged. At least if we don't try to squeeze too many clusters in the hardware available. Nothing to say about the reported issue though, needs investigation.
Hard to reproduce locally ( I didn't managed ) but I think https://github.com/SUSE/catapult/pull/148 might help, it definitely worth a shot imho
After seeing comment from @jandubois ( https://github.com/cloudfoundry-incubator/kubecf/pull/572#issuecomment-606161513 ) I'm suspecting this is not anymore a CI issue.. just a stretched hunch that needs investigation.
I'm saying that because now Catapult does check if CRDs are in place, as you can see from the logs are indeed present:
[./install.sh] [backend:imported] [cluster:kind-MGJkY2I0Yzg3YmJjMWRmZWM0Nzk1NDdh] Wait for cf-operator to be ready
[./install.sh] [backend:imported] [cluster:kind-MGJkY2I0Yzg3YmJjMWRmZWM0Nzk1NDdh] Waiting for kubectl get endpoints -n cf-operator cf-operator-webhook -o name
endpoints/cf-operator-webhook
[./install.sh] [backend:imported] [cluster:kind-MGJkY2I0Yzg3YmJjMWRmZWM0Nzk1NDdh] Waiting for kubectl get crd quarksstatefulsets.quarks.cloudfoundry.org -o name
customresourcedefinition.apiextensions.k8s.io/quarksstatefulsets.quarks.cloudfoundry.org
[./install.sh] [backend:imported] [cluster:kind-MGJkY2I0Yzg3YmJjMWRmZWM0Nzk1NDdh] Waiting for kubectl get crd quarkssecrets.quarks.cloudfoundry.org -o name
customresourcedefinition.apiextensions.k8s.io/quarkssecrets.quarks.cloudfoundry.org
[./install.sh] [backend:imported] [cluster:kind-MGJkY2I0Yzg3YmJjMWRmZWM0Nzk1NDdh] Waiting for kubectl get crd quarksjobs.quarks.cloudfoundry.org -o name
customresourcedefinition.apiextensions.k8s.io/quarksjobs.quarks.cloudfoundry.org
[./install.sh] [backend:imported] [cluster:kind-MGJkY2I0Yzg3YmJjMWRmZWM0Nzk1NDdh] Waiting for kubectl get crd boshdeployments.quarks.cloudfoundry.org -o name
customresourcedefinition.apiextensions.k8s.io/boshdeployments.quarks.cloudfoundry.org
[./install.sh] [backend:imported] [cluster:kind-MGJkY2I0Yzg3YmJjMWRmZWM0Nzk1NDdh] cf-operator ready
secret/susecf-scf.var-cf-admin-password created
Error: unable to build kubernetes objects from release manifest: [unable to recognize "": no matches for kind "BOSHDeployment" in version "quarks.cloudfoundry.org/v1alpha1", unable to recognize "": no matches for kind "QuarksJob" in version "quarks.cloudfoundry.org/v1alpha1", unable to recognize "": no matches for kind "QuarksSecret" in version "quarks.cloudfoundry.org/v1alpha1", unable to recognize "": no matches for kind "QuarksStatefulSet" in version "quarks.cloudfoundry.org/v1alpha1"]
It might be we need to wait for other resources to be set. @viovanov any thoughts?
on the other hand seems to happen less likely now, so it might be also a (slow) flaky VM causing this. I could recreate the stack to check of out if this happens again, but I think it shouldn't happen anyway, if we detect a race condition like this, it might face the user as well.
Spoke with @jimmykarily and @viovanov , trying now to deploy other quarks resources (quarks statefulset, and so on) to make sure the cf-operator picks it up before deploying KubeCF, WIP in https://github.com/SUSE/catapult/tree/wait_crd_to_be_ready
Github decided to close the issue automatically since saw the "fix" keyword in the other repo :man_facepalming: reopening, sorry for the noise
The workaround on the Catapult side of things seems to work, couldn't notice another occurrence of this issue anymore, but maybe it's too soon to speak. Shall we close and re-open if happens again?
Sounds good, @mudler!
Describe the bug The kubecf pipeline is failing intermittently with following error message:
https://concourse.suse.dev/teams/main/pipelines/kubecf/jobs/deploy-diego/builds/268
To Reproduce Trigger a kubecf build and this error will appear intermittently.
Expected behavior CI to pass without such failures.
Environment
Additional context NA