druid-io / druid-operator

Druid Kubernetes Operator
Other
205 stars 93 forks source link

Druid pods taking more time to come up when security enabled #301

Open bharathkuppala opened 2 years ago

bharathkuppala commented 2 years ago

Hello All, druid-operator: v0.0.7 druid: v0.22.1

We are trying to enable basic security for druid and we found out that it was taking Approx. 16 min for entire cluster to come up. Upon investigation on logs we found out that there were multiple retries against coordinator trying to fetch some users data.

Logs from middleManager: {"timeMillis":1653992838211,"thread":"main","level":"WARN","loggerName":"org.apache.druid.java.util.common.RetryUtils","message":"Retrying (1 of 9) in 727ms.","thrown":{"commonElementCount":0,"localizedMessage":"No content to map due to end-of-input\n at [Source: (byte[])\"\"; line: -1, column: 0]","message":"No content to map due to end-of-input\n at [Source: (byte[])\"\"; line: -1, column: 0]","name":"com.fasterxml.jackson.databind.exc.MismatchedInputException","extendedStackTrace":[{"class":"com.fasterxml.jackson.databind.exc.MismatchedInputException","method":"from","file":"MismatchedInputException.java","line":59,"exact":false,"location":"jackson-databind-2.10.5.1.jar","version":"2.10.5.1"},{"class":"com.fasterxml.jackson.databind.ObjectMapper","method":"_initForReading","file":"ObjectMapper.java","line":4360,"exact":false,"location":"jackson-databind-2.10.5.1.jar","version":"2.10.5.1"},{"class":"com.fasterxml.jackson.databind.ObjectMapper","method":"_readMapAndClose","file":"ObjectMapper.java","line":4205,"exact":false,"location":"jackson-databind-2.10.5.1.jar","version":"2.10.5.1"},{"class":"com.fasterxml.jackson.databind.ObjectMapper","method":"readValue","file":"ObjectMapper.java","line":3292,"exact":false,"location":"jackson-databind-2.10.5.1.jar","version":"2.10.5.1"},{"class":"org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager","method":"tryFetchUserMapsFromCoordinator","file":"CoordinatorPollingBasicAuthorizerCacheManager.java","line":400,"exact":false,"location":"?","version":"?"},{"class":"org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager","method":"lambda$fetchUserAndRoleMapFromCoordinator$4","file":"CoordinatorPollingBasicAuthorizerCacheManager.java","line":330,"exact":false,"location":"?","version":"?"},{"class":"org.apache.druid.java.util.common.RetryUtils","method":"retry","file":"RetryUtils.java","line":129,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.java.util.common.RetryUtils","method":"retry","file":"RetryUtils.java","line":81,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.java.util.common.RetryUtils","method":"retry","file":"RetryUtils.java","line":163,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.java.util.common.RetryUtils","method":"retry","file":"RetryUtils.java","line":153,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager","method":"fetchUserAndRoleMapFromCoordinator","file":"CoordinatorPollingBasicAuthorizerCacheManager.java","line":328,"exact":true,"location":"druid-basic-security-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager","method":"initUserMaps","file":"CoordinatorPollingBasicAuthorizerCacheManager.java","line":457,"exact":true,"location":"druid-basic-security-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager","method":"start","file":"CoordinatorPollingBasicAuthorizerCacheManager.java","line":116,"exact":true,"location":"druid-basic-security-0.22.1.jar","version":"0.22.1"},{"class":"sun.reflect.NativeMethodAccessorImpl","method":"invoke0","file":"NativeMethodAccessorImpl.java","line":-2,"exact":false,"location":"?","version":"1.8.0_275"},{"class":"sun.reflect.NativeMethodAccessorImpl","method":"invoke","file":"NativeMethodAccessorImpl.java","line":62,"exact":false,"location":"?","version":"1.8.0_275"},{"class":"sun.reflect.DelegatingMethodAccessorImpl","method":"invoke","file":"DelegatingMethodAccessorImpl.java","line":43,"exact":false,"location":"?","version":"1.8.0_275"},{"class":"java.lang.reflect.Method","method":"invoke","file":"Method.java","line":498,"exact":false,"location":"?","version":"1.8.0_275"},{"class":"org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler","method":"start","file":"Lifecycle.java","line":446,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.java.util.common.lifecycle.Lifecycle","method":"start","file":"Lifecycle.java","line":341,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.guice.LifecycleModule$2","method":"start","file":"LifecycleModule.java","line":143,"exact":true,"location":"druid-core-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.cli.GuiceRunnable","method":"initLifecycle","file":"GuiceRunnable.java","line":115,"exact":true,"location":"druid-services-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.cli.ServerRunnable","method":"run","file":"ServerRunnable.java","line":63,"exact":true,"location":"druid-services-0.22.1.jar","version":"0.22.1"},{"class":"org.apache.druid.cli.Main","method":"main","file":"Main.java","line":113,"exact":true,"location":"druid-services-0.22.1.jar","version":"0.22.1"}]},"endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":1,"threadPriority":5}

In the above log its looking for some method which would fetch some information from coordinator and would retry for 9 times.

Wanted to know does druid-operator impose any restrictions upon druid installation like order its broker, historicals and middlemanager which would comeup first during scratch installation then comes coordinator and router?

If above is true then pods will fail looking for coordinator? Any help on this?

bharathkuppala commented 2 years ago

@AdheipSingh : Can you please provide any info on this ? if you know already

AdheipSingh commented 2 years ago

sorry missed this out, the druid operator has nothing do with security enabled on druid or not. Druid operator supports securityContexts on the kubernetes pods it deploys and manages. When rolling deploy is enabled to true the druid operator upgrades the nodes in the order as described here https://github.com/druid-io/druid-operator/blob/master/controllers/druid/handler.go#L1765 When the operator installs the cluster all pods are deployed parallel.

bharathkuppala commented 2 years ago

Thanks for the rly @AdheipSingh and sry for not responding actually we were testing this and found out that when we give all node types as a statefulset like you said all pods are deployed in parallel, But let's say in my case we are having Deployments- Broker, Coordinator and Router Statefulsets - Historicals and MiddleManagers

Now when we install we see along with 2 statefulsets only one deployment is getting up hence stopping other deployments from coming up.

Can you please check this behavior if not already done? Or any suggestions would be of a great help!!

Stack trace 2022-09-08T05:40:35.675Z ERROR controller-runtime.manager.controller.druid Reconciler error {"reconciler group": "druid.apache.org", "reconciler kind": "Druid", "name": "tsdb-apache-druid", "namespace": "nom-backend", "error": "StatefulSet.apps \"druid-tsdb-apache-druid-brokers\" not found"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /home/jenkins/workspace/NM/Neo/operators/druid-operator/druid-operator-druid-operator-0.0.8/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /home/jenkins/workspace/NM/Neo/operators/druid-operator/druid-operator-druid-operator-0.0.8/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214 After above stack trace it would stop near broker preventing other deployments to come up Thanks!!

bharathkuppala commented 2 years ago

Intermediately i see this issue after doing scratch install multiple times. But i always see above stack trace no sure why :(

AdheipSingh commented 2 years ago

@bharathkuppala https://github.com/druid-io/druid-operator/blob/master/controllers/druid/handler.go#L182

Be deployment or statefulset in any combination, all pods should come up in parallel for the 1st installation.

Can you tell me the generation number in the CR metadata, if its > 1 and rollingDeploy is kept to true in your CR, upgrades and changes will happen incremental.

Use Kind: Deployemnt to switch to deployment by default its statefulset

bharathkuppala commented 2 years ago

@AdheipSingh : Yes like you said there is no problem while doing the upgrade with generation number > 1 it would follow the prescribed order. But somehow i see with podMangamentPolicy as parallel both Historical and MM are deployed next broker would comeup stopping other deployments

bharathkuppala commented 2 years ago

We are trying our druid cluster in open shift. Just wanted to know do we have to check something on open shift side to troubleshoot this behavior? Because in operator we saw it often hitting this line and giving else error. https://github.com/druid-io/druid-operator/blob/5abef01594ff953ba8a092aa7d392c06d682b167/controllers/druid/handler.go#L680

AdheipSingh commented 2 years ago

@bharathkuppala IMHO it should not arise, i will try to re-create this issue. Just to double check, did you switch between sts and deployments in Kind ?

bharathkuppala commented 2 years ago

@AdheipSingh : Yes i did switch but still i see this issue

bharathkuppala commented 2 years ago

druid-tsdb-apache-druid-historicals-0 0/1 Pending 0 0s druid-tsdb-apache-druid-historicals-0 0/1 Pending 0 0s druid-tsdb-apache-druid-middlemanagers-0 0/1 Pending 0 0s druid-tsdb-apache-druid-middlemanagers-0 0/1 Pending 0 1s druid-tsdb-apache-druid-historicals-1 0/1 Pending 0 1s druid-tsdb-apache-druid-historicals-1 0/1 Pending 0 1s druid-tsdb-apache-druid-middlemanagers-1 0/1 Pending 0 1s druid-tsdb-apache-druid-middlemanagers-1 0/1 Pending 0 1s druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 Pending 0 1s druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 Pending 0 1s druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 Pending 0 1s druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 Pending 0 1s druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 Pending 0 1s druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 ContainerCreating 0 2s druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 ContainerCreating 0 2s druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 ContainerCreating 0 2s druid-tsdb-apache-druid-historicals-0 0/1 Pending 0 2s druid-tsdb-apache-druid-middlemanagers-0 0/1 Pending 0 2s druid-tsdb-apache-druid-middlemanagers-1 0/1 Pending 0 2s druid-tsdb-apache-druid-historicals-1 0/1 Pending 0 2s druid-tsdb-apache-druid-middlemanagers-1 0/1 Pending 0 2s druid-tsdb-apache-druid-historicals-0 0/1 Pending 0 2s druid-tsdb-apache-druid-middlemanagers-0 0/1 ContainerCreating 0 2s druid-tsdb-apache-druid-historicals-0 0/1 ContainerCreating 0 2s druid-tsdb-apache-druid-middlemanagers-1 0/1 ContainerCreating 0 3s druid-tsdb-apache-druid-historicals-1 0/1 Pending 0 3s druid-tsdb-apache-druid-middlemanagers-0 0/1 ContainerCreating 0 3s druid-tsdb-apache-druid-historicals-1 0/1 ContainerCreating 0 3s druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 ContainerCreating 0 3s druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 ContainerCreating 0 3s druid-tsdb-apache-druid-historicals-0 0/1 ContainerCreating 0 6s druid-tsdb-apache-druid-middlemanagers-1 0/1 ContainerCreating 0 8s druid-tsdb-apache-druid-middlemanagers-0 0/1 ContainerCreating 0 8s druid-tsdb-apache-druid-historicals-1 0/1 ContainerCreating 0 9s druid-tsdb-apache-druid-historicals-1 0/1 Running 0 1m druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 0/1 Running 0 1m druid-tsdb-apache-druid-brokers-86c4849b47-4598x 0/1 Running 0 1m druid-tsdb-apache-druid-middlemanagers-1 0/1 Running 0 1m druid-tsdb-apache-druid-historicals-0 0/1 Running 0 1m druid-tsdb-apache-druid-middlemanagers-0 0/1 Running 0 2m druid-tsdb-apache-druid-historicals-0 1/1 Running 0 13m druid-tsdb-apache-druid-middlemanagers-1 1/1 Running 0 13m druid-tsdb-apache-druid-brokers-86c4849b47-pkp5d 1/1 Running 0 14m druid-tsdb-apache-druid-brokers-86c4849b47-4598x 1/1 Running 0 14m druid-tsdb-apache-druid-middlemanagers-0 1/1 Running 0 14m druid-tsdb-apache-druid-historicals-1 1/1 Running 0 14m druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 Pending 0 0s druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 Pending 0 1s druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 Pending 0 1s druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 Pending 0 2s druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 Pending 0 2s druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 ContainerCreating 0 2s druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 Pending 0 2s druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 ContainerCreating 0 3s druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 ContainerCreating 0 3s druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 ContainerCreating 0 3s druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 0/1 Running 0 3s druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 0/1 Running 0 4s druid-tsdb-apache-druid-coordinators-66bf9f7545-gwmcv 1/1 Running 0 30s druid-tsdb-apache-druid-coordinators-66bf9f7545-xtkt5 1/1 Running 0 30s druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 Pending 0 0s druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 Pending 0 0s druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 Pending 0 0s druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 Pending 0 0s druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 Pending 0 1s druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 Pending 0 1s druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 ContainerCreating 0 1s druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 ContainerCreating 0 1s druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 ContainerCreating 0 2s druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 ContainerCreating 0 3s druid-tsdb-apache-druid-routers-56d5bd696-9t9d8 0/1 Running 0 3s druid-tsdb-apache-druid-routers-56d5bd696-ddm2d 0/1 Running 0 4s

This is what is happening in my current setup once historical, MM and Broker pods are up then only coordinator and routers are triggered

AdheipSingh commented 2 years ago

are you adding any node selectors ? are you coordinator/routers get scheduled on different nodes if you are rollingDeploy is set to false, all pods should come in parallel, there is no conditional check to wait for druid nodes on the first installation.

bharathkuppala commented 2 years ago

@AdheipSingh : No we are not using any node selectors for scheduling the pods. And we are using rollingDeploy as true because we want our pods to have an upgrade based on prescribed order.

Yea i see on druid-operator code we are not checking generation so pods should come in parallel. I can share logs of druid-operator so we could have a complete trace.

AdheipSingh commented 2 years ago

@bharathkuppala whats your podmanagement policy ? try setting to Parallel

bharathkuppala commented 1 year ago

@AdheipSingh : Yes it is by default parallel right for sts?