GoogleCloudPlatform / spinnaker-for-gcp

Production-ready Spinnaker on GKE
175 stars 131 forks source link

Infrastructure tab in spinnaker pipeline is not populating data instantly #119

Closed chiragthaker closed 5 years ago

chiragthaker commented 5 years ago

On deploying new replica set via kubernetes v2 using highlander strategy or any other strategy , infrastructure tab doesnt pick up the deployment status once done instantly and takes like 10-15 mins to pop up or something it just disappers totally,

We are stuck with this and trying to resolve it.,

chiragthaker commented 5 years ago

[ool-9-thread-13] c.n.s.c.o.DefaultOrchestrationProcessor : java.lang.NullPointerException at com.netflix.spinnaker.clouddriver.kubernetes.v2.description.manifest.KubernetesManifestAnnotater.getTraffic(KubernetesManifestAnnotater.java:231) at com.netflix.spinnaker.clouddriver.kubernetes.v2.op.manifest.AbstractKubernetesEnableDisableManifestOperation.determineLoadBalancers(AbstractKubernetesEnableDisableManifestOperation.java:73) at com.netflix.spinnaker.clouddriver.kubernetes.v2.op.manifest.AbstractKubernetesEnableDisableManifestOperation.operate(AbstractKubernetesEnableDisableManifestOperation.java:131) at com.netflix.spinnaker.clouddriver.kubernetes.v2.op.manifest.AbstractKubernetesEnableDisableManifestOperation.operate(AbstractKubernetesEnableDisableManifestOperation.java:39) at com.netflix.spinnaker.clouddriver.orchestration.AtomicOperation$operate.call(Unknown Source) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47) at com.netflix.spinnaker.clouddriver.orchestration.AtomicOperation$operate.call(Unknown Source) at com.netflix.spinnaker.clouddriver.orchestration.DefaultOrchestrationProcessor$_process_closure1$_closure2.doCall(DefaultOrchestrationProcessor.groovy:89) at com.netflix.spinnaker.clouddriver.orchestration.DefaultOrchestrationProcessor$_process_closure1$_closure2.doCall(DefaultOrchestrationProcessor.groovy) at sun.reflect.GeneratedMethodAccessor559.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:101) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323) at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:263) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1041) at groovy.lang.Closure.call(Closure.java:405) at groovy.lang.Closure.call(Closure.java:399) at com.netflix.spinnaker.clouddriver.metrics.TimedCallable$ClosureWrapper.call(TimedCallable.groovy:55) at com.netflix.spinnaker.clouddriver.metrics.TimedCallable.call(TimedCallable.groovy:82) at java_util_concurrent_Callable$call.call(Unknown Source) at com.netflix.spinnaker.clouddriver.orchestration.DefaultOrchestrationProcessor$_process_closure1.doCall(DefaultOrchestrationProcessor.groovy:88) at com.netflix.spinnaker.clouddriver.orchestration.DefaultOrchestrationProcessor$_process_closure1.doCall(DefaultOrchestrationProcessor.groovy) at sun.reflect.GeneratedMethodAccessor556.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:101) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323) at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:263) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1041) at groovy.lang.Closure.call(Closure.java:405) at groovy.lang.Closure.call(Closure.java:399) at com.netflix.spinnaker.security.AuthenticatedRequest.lambda$propagate$0(AuthenticatedRequest.java:129) at com.netflix.spinnaker.clouddriver.metrics.TimedCallable.call(TimedCallable.groovy:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

duftler commented 5 years ago

When you are not using one of the deployment strategies, does the deploy manifest stage complete and the clusters tab become populated with the new replica set?

Also, what account & namespace are you deploying to? Spinnaker for GCP is configured to not index the spinnaker namespace of the spinnaker-install-acount. So if you are deploying to that account, you have to explicitly specify namespaces for your resources.

chiragthaker commented 5 years ago

Hi @duftler : When you are not using one of the deployment strategies, does the deploy manifest stage complete and the clusters tab become populated with the new replica set? - No it takes around 12 minutes or even more sometime to populate the replicaset , so i believe its some caching issue ?

Also, what account & namespace are you deploying to? Spinnaker for GCP is configured to not index the spinnaker namespace of the spinnaker-install-acount. So if you are deploying to that account, you have to explicitly specify namespaces for your resources. -- We are not using this default namespace and account , we have a different namespace and account configured for deployment

Also for the k8 service account we have enabled --live-manifests calls which helped in significantly reduce the force cache refresh thingy for every job and hence descreasing deployment time significantly

duftler commented 5 years ago

@maggieneterval fyi

maggieneterval commented 5 years ago

Hi @chiragthaker -- The NPE you posted above suggests that Clouddriver is unable to read annotations from your manifest. A couple of questions to help us get to the bottom of this:

chiragthaker commented 5 years ago

Hi @maggieneterval : Spinnaker version 1.15.3

Deploy manifest stage json :

{ "account": "spinnaker-app-test-deploy-account", "cloudProvider": "kubernetes", "manifestArtifactAccount": "github-artifact-acc", "manifestArtifactId": "b04ea0c9-43a7-4ae4-8465-6cc83479f9f8", "moniker": { "app": "sample" }, "name": "Deploy Application", "relationships": { "loadBalancers": [], "securityGroups": [] }, "requiredArtifactIds": [], "skipExpressionEvaluation": false, "source": "artifact", "trafficManagement": { "enabled": true, "options": { "enableTraffic": true, "namespace": "demo-app", "services": [ "service demo-app" ], "strategy": "highlander" } }, "type": "deployManifest" }

maggieneterval commented 5 years ago

Thanks @chiragthaker, would you also mind posting your full manifest YAML? Thanks!

chiragthaker commented 5 years ago

Hi @maggieneterval : here is the deployment yaml for our config. We have seperate yaml for ingress , services and namespace but this is deployment yaml

apiVersion: apps/v1 kind: ReplicaSet metadata: name: demo-app-replicaset namespace: demo-app spec: selector: matchLabels: name: demo-app-replicaset replicas: 1 # tells deployment to run 2 pods matching the template template: metadata: labels: name: demo-app-replicaset spec: containers:

maggieneterval commented 5 years ago

Thanks! Your config all looks good, not sure at first glance why your deploy is failing. I can dig a little deeper into this later this week, in the meantime could you let me know how you added your Kubernetes account to Spinnaker for GCP, and how you upgraded your Spinnaker version to 1.15?

chiragthaker commented 5 years ago

Hi @maggieneterval : we used the managed scripts for both the stuff , for instance

Here is the one for adding spinnaker - k8 account thing : https://github.com/GoogleCloudPlatform/spinnaker-for-gcp/blob/master/scripts/manage/add_gke_account.sh

For upgrading : https://github.com/GoogleCloudPlatform/spinnaker-for-gcp/blob/master/scripts/manage/update_spinnaker_version.sh

chiragthaker commented 5 years ago

@maggieneterval : Deploy fails on 2nd run when i perform 2 deployments back to back.

So flow goes like this .

  1. 1st deploy just works fine and highlander strategy does it jobs ( create new RS v1 , disable older RS v0 and finally deletes the older RS v0) and job is green and all good.
  2. After couple of minute i have another commit and new deploy , here job fails , so it does creates new replica set v2 but it fails to disable older replica set v1 which was created at step 1 , i assume because of some caching spinnaker is not able to fetch the data back from the cluster.
  3. Wait for like 10-15 mins and you see the status on infrastructure cluster tab and do another deployment which will work.

So main issue here is successive deployment in like span of 5 mins doesn't work with rolling strategies which i feel leads me to some wierd caching issue.

Hope this explains exact scenario,

chiragthaker commented 5 years ago

Hi @maggieneterval : You had a chance of looking at this ?

maggieneterval commented 5 years ago

So main issue here is successive deployment in like span of 5 mins doesn't work with rolling strategies which i feel leads me to some wierd caching issue.

Thanks for clarifying the issue you're facing -- to confirm, do you have liveManifestCalls enabled? Spinnaker-managed rollout strategies rely on caching and so are not compatible with live manifest mode being enabled.

chiragthaker commented 5 years ago

Hi @maggieneterval : Yes liveManifestCalls is enabled and actually we had to ( else force cache refresh) stage would take around 12 mins per stage , dont think that was good solution for us.

maggieneterval commented 5 years ago

Thanks for letting me know, I'm sorry that the force cache refresh task is taking so long. Do you notice any errors in your Orca or Clouddriver logs during the force cache refresh? Unfortunately for the time being you will need to choose between either enabling liveManifestCalls or using Spinnaker-managed rollout strategies, but hopefully we can address the root cause of the long force cache refresh so you are able to disable liveManifestCalls and use the strategies.

duftler commented 5 years ago

Looks from my read of this thread like this can be closed (since we are not intending to add support for rollout strategies with liveManifestCalls enabled).

Please re-open if I've misread this somehow.