Allow user to completely remove/disable scheduler in 1.14+

dapr / dapr

Dapr is a portable, event-driven, runtime for building distributed applications across cloud and edge.

https://dapr.io

Apache License 2.0

24.16k stars 1.91k forks source link

Allow user to completely remove/disable scheduler in 1.14+ #8100

Open artursouza opened 2 months ago

artursouza commented 2 months ago

In what area(s)?

/area runtime

/area operator

/area placement

/area docs

/area test-and-release

What version of Dapr?

1.1.x 1.0.x edge: output of git describe --dirty 1.14.x

Expected Behavior

If user does not want to use scheduler, via the Helm chart setting global.scheduler.enabled, should allow scheduler's k8s resources not to be provisioned at all (no statefulset, service, nothing). The sidecar should also be smart enough (via env var in the sidecar injector, maybe) to not try to connect to scheduler on start up. Job's API should error proactively saying that scheduler is not available instead of trying to perform any operation.

This should work equally if user does not install scheduler's container on Docker for standalone mode.

Actual Behavior

failed to watch scheduler jobs, retrying: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp: lookup dapr-scheduler-server-1.dapr-scheduler-server.dapr-system.svc.cluster.local on 172.20.0.10:53: no such host\""

Steps to Reproduce the Problem

Disable scheduler (via global.scheduler.enabled or setting replica to 0 for scheduler's statefulset) and try to run any app on K8s.

Release Note

RELEASE NOTE: UPDATED control plane and sidecar to work without Scheduler service

artursouza commented 2 months ago

As a workaround, set the following:

For K8s mode: set annotation on app's Deployment: dapr.io/scheduler-host-address: ""
For standalone mode, set env: export DAPR_SCHEDULER_HOST_ADDRESS=""

cicoyle commented 2 months ago

To fix part of this issue, we should have dynamic host resolution via the injector, rn that code isnt dynamic. We can also be more graceful with our errors in the jobs api, bc it assumes scheduler is there if the jobs api is being used.

rgitpriv commented 4 weeks ago

As a workaround, set the following:

For K8s mode: set annotation on app's Deployment: dapr.io/scheduler-host-address: ""

For standalone mode, set env: export DAPR_SCHEDULER_HOST_ADDRESS=""

Hi, that workaround is not working for us on 1.14.4, can you please confirm? I've tried with our deployments on EKS but here's a repro using DAPR resources, on a clean minikube install with k8s version 1.29.7:

Install 1.14.4 via dapr init -k --runtime-version=1.14.4 --set dapr_placement.cluster.forceInMemoryLog=true --set dapr_scheduler.cluster.inMemoryStorage=true
Scale the scheduler statefulset to zero with k scale statefulset -n dapr-system dapr-scheduler-server --replicas=0
Pull down https://github.com/dapr/quickstarts
Edit quickstarts/tutorials/hello-kubernetes/deploy/node.yaml
Under the deployment, change spec -> template -> annotations to include dapr.io/scheduler-host-address: "" under the existing dapr.io/enabled: "true" annotation.
Apply the tutorial manifests via k apply -f tutorials/hello-kubernetes/deploy/
Verify the annotation is applied: k get pod -l app=node -o yaml | grep scheduler-host-address
View the logs of the node app via k logs -f -l app=node -c daprd

The logs demonstrate the application is trying to communicate with the scheduler despite the annotation.

time="2024-10-24T00:18:56.859283132Z" level=error msg="failed to watch scheduler jobs, retrying: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp: lookup dapr-scheduler-server-0.dapr-scheduler-server.dapr-system.svc.cluster.local on 10.96.0.10:53: no such host\"" app_id=nodeapp instance=nodeapp-f49d5496d-cg4rl scope=dapr.runtime.scheduler type=log ver=1.14.4

I've tried several k8s versions to no avail, this is an issue for us because the scheduler statefulset occasionally crashes which then triggers another bug where the failed scheduler instance can't rejoin the cluster with error message like "error running scheduler: member has already been bootstrapped", which then causes a crash loop. The pods that are talking to that statefulset by the hard coded id start spamming errors in the daprd sidecar. I end up scaling the set to 0 then recreate to resolve the issue.

In dev+prod we use HA mode so we have the 3 member statefulset. We also use the in-memory option, wanting to keep our clusters stateless with no PV/PVCs if at all possible.

I've seen a lot of discussion around improving the scheduler, we have a new product to release and have been using 1.14.x to dev/test on so we'd rather not go back to 1.13 if possible.

yaron2 commented 4 weeks ago

As a workaround, set the following:

For K8s mode: set annotation on app's Deployment: dapr.io/scheduler-host-address: ""

For standalone mode, set env: export DAPR_SCHEDULER_HOST_ADDRESS=""

Hi, that workaround is not working for us on 1.14.4, can you please confirm? I've tried with our deployments on EKS but here's a repro using DAPR resources, on a clean minikube install with k8s version 1.29.7:

Install 1.14.4 via dapr init -k --runtime-version=1.14.4 --set dapr_placement.cluster.forceInMemoryLog=true --set dapr_scheduler.cluster.inMemoryStorage=true

Scale the scheduler statefulset to zero with k scale statefulset -n dapr-system dapr-scheduler-server --replicas=0

Pull down https://github.com/dapr/quickstarts

Edit quickstarts/tutorials/hello-kubernetes/deploy/node.yaml

Under the deployment, change spec -> template -> annotations to include dapr.io/scheduler-host-address: "" under the existing dapr.io/enabled: "true" annotation.

Apply the tutorial manifests via k apply -f tutorials/hello-kubernetes/deploy/

Verify the annotation is applied: k get pod -l app=node -o yaml | grep scheduler-host-address

View the logs of the node app via k logs -f -l app=node -c daprd

The logs demonstrate the application is trying to communicate with the scheduler despite the annotation.

time="2024-10-24T00:18:56.859283132Z" level=error msg="failed to watch scheduler jobs, retrying: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp: lookup dapr-scheduler-server-0.dapr-scheduler-server.dapr-system.svc.cluster.local on 10.96.0.10:53: no such host\"" app_id=nodeapp instance=nodeapp-f49d5496d-cg4rl scope=dapr.runtime.scheduler type=log ver=1.14.4

I've tried several k8s versions to no avail, this is an issue for us because the scheduler statefulset occasionally crashes which then triggers another bug where the failed scheduler instance can't rejoin the cluster with error message like "error running scheduler: member has already been bootstrapped", which then causes a crash loop. The pods that are talking to that statefulset by the hard coded id start spamming errors in the daprd sidecar. I end up scaling the set to 0 then recreate to resolve the issue.

In dev+prod we use HA mode so we have the 3 member statefulset. We also use the in-memory option, wanting to keep our clusters stateless with no PV/PVCs if at all possible.

I've seen a lot of discussion around improving the scheduler, we have a new product to release and have been using 1.14.x to dev/test on so we'd rather not go back to 1.13 if possible.

Can you paste the logs of the crashing scheduler instance(s)? We need to understand why it's crashing

rgitpriv commented 4 weeks ago

Can you paste the logs of the crashing scheduler instance(s)? We need to understand why it's crashing

Sure, I can try to dig up a repro. It wasn't completely consistent behavior. Really hoping to figure out how to just disable the dapr sidecar from trying to connect to scheduler for now, since we're not using it and it's being problematic.

I do have some logs. So what we see is the CrashLoopBackOff

k get pods -n dapr-system -l app=dapr-scheduler-server NAME READY STATUS RESTARTS AGE dapr-scheduler-server-0 0/1 CrashLoopBackOff 23 (4m5s ago) 98m dapr-scheduler-server-1 1/1 Running 4 (11d ago) 11d dapr-scheduler-server-2 1/1 Running 4 (11d ago) 11d

Then it can't rejoin (there's already a bug on it floating around)

level=fatal msg="error running scheduler: member e765f6aceb4d1e29 has already been bootstrapped" instance=dapr-scheduler-server-0 scope=dapr.scheduler type=log ver=1.14.4

FWIW this also happened on previous versions of 1.14.x and is the reason we upgraded to latest 1.14.4

I'll attach two logfiles, scrubbed of our application data. They're the same log, just one text and one json (from loki). What I see is around 2024-10-04T03:55:31 we get a flurry of "Error receiving from connection: rpc error: code = Canceled desc = context canceled", then a termination signal received, but no details as to what triggered that termination signal. Then you can see it restart and the issues with "already been bootstrapped".

My current working theory is that there was node memory pressure, the statefulset pod got evicted and then crashLoop started because of the bug in rejoining the etcd cluster. We force in-memory scheduler and placement when initializing, because we want to keep our clusters stateless. It's just a gut feeling, but we were also scaling up nodes via autoscaler around the same time due to memory pressure.

I'll have to catch it live again though to be sure.

scheduler.json scheduler.txt

edit; formatting

cicoyle commented 4 weeks ago

Hi @rgitpriv - The Scheduler does not currently auto-scale. We do have a proposal open to implement the auto-scaling of Scheduler post v1.15.

By running the k scale command, that is dynamically scaling the Scheduler, which is currently not supported. Please use this command to set the Scheduler to be disabled to resolve your issue:

dapr init -k --runtime-version=1.14.4 --set dapr_placement.cluster.forceInMemoryLog=true --set dapr_scheduler.cluster.inMemoryStorage=true --set global.scheduler.enabled=false

Note the --set global.scheduler.enabled=false flag.

The error you mentioned, Error receiving from connection: rpc error: code = Canceled desc = context canceled comes from a go routine that is watching the dapr sidecars. It looks like the sidecar is going down and context cancelling and closing the stream connection. I haven't seen this causing the Scheduler to crash.

For example, if I run the Scheduler locally with a sidecar, then kill my app + sidecar, then I see some Scheduler logs to the tune of:

WARN[0057] Error receiving from connection: rpc error: code = Canceled desc = context canceled  instance=Cassandras-MacBook-Pro.local scope=dapr.runtime.scheduler type=log ver=unknown
INFO[0057] Removing a Sidecar connection from Scheduler for ...

However, the Scheduler does not crash.

If you are seeing the Scheduler crash without scaling it, since that is currently not supported, please provide the logs.

rgitpriv commented 4 weeks ago

Hi @cicoyle - your note to completely disable the scheduler at install time resolved my issue. The bug happens if the scheduler was enabled at install and you want to later disable. So we're all good now, thanks. Some additional notes:

By running the k scale command, that is dynamically scaling the Scheduler, which is currently not supported.

My apologies for not being clear, we also use --enable-ha=true flag so we get 3 scheduler pods. I was using the scale command to recreate the statefulset pods entirely once a pod gets in CrashLoop. By scaling to 0 then back to the HA number of 3, the error goes away as the etcd cluster is rebuilt.

If you are seeing the Scheduler crash without scaling it, since that is currently not supported, please provide the logs.

I am able to consistently repro the CrashLoop by simply terminating one of the nodes a statefulset pod runs on. When the pod is recreated, it goes into a CrashLoop, the logs are:

time="2024-10-24T18:37:16.146481877Z" level=fatal msg="error running scheduler: member e765f6aceb4d1e29 has already been bootstrapped" instance=dapr-scheduler-server-0 scope=dapr.scheduler type=log ver=1.14.4

xendren commented 1 week ago

Any ideas when the sidecar portion of this issue will be fixed? We have disabled the job scheduler during install using the global.scheduler.enabled=false, but our sidecar logs are swamped with the "failed to watch scheduler jobs" error messages. We have too many applications using Dapr in k8s to update every one of them with the workaround annotation, and the frequency is killing our log aggregation from k8s.

yaron2 commented 1 week ago

Any ideas when the sidecar portion of this issue will be fixed? We have disabled the job scheduler during install using the global.scheduler.enabled=false, but our sidecar logs are swamped with the "failed to watch scheduler jobs" error messages. We have too many applications using Dapr in k8s to update every one of them with the workaround annotation, and the frequency is killing our log aggregation from k8s.

1.15 for sure, but it looks like we may need to backport this specific fix to 1.14 sooner. @cicoyle what do we need for that?

@xendren please ping me in a DM on Dapr Discord

cicoyle commented 1 week ago

This is actively being worked on. We need the cron library refactor to restart the leadership and cron library in general upon leadership changes then pipe that thru to to scheduler then to dapr. I am actively working on this, but it is non trivial. It will go in for 1.15.

@xendren, Did you by chance try restarting your daprd sidecars after setting global.scheduler.enabled=false. If the sidecars are throwing that log then they likely need to be restarted bc they think scheduler is enabled.

xendren commented 1 week ago

Any ideas when the sidecar portion of this issue will be fixed? We have disabled the job scheduler during install using the global.scheduler.enabled=false, but our sidecar logs are swamped with the "failed to watch scheduler jobs" error messages. We have too many applications using Dapr in k8s to update every one of them with the workaround annotation, and the frequency is killing our log aggregation from k8s.

1.15 for sure, but it looks like we may need to backport this specific fix to 1.14 sooner. @cicoyle what do we need for that?

@xendren please ping me in a DM on Dapr Discord

Apologies. We have been too busy this week. We decided to revert to Dapr v1.13.6 and wait until a release without the issues. Before I downgraded, I noticed there were 3 scheduler pods, I believe, that were showing in error due to no persistent volume being available and no persistent volume class available to provision the PVs. I'm guessing that was the cause of our initial error where the daprd sidecar was in a crashback loop with failure to connect.

xendren commented 1 week ago

This is actively being worked on. We need the cron library refactor to restart the leadership and cron library in general upon leadership changes then pipe that thru to to scheduler then to dapr. I am actively working on this, but it is non trivial. It will go in for 1.15.

@xendren, Did you by chance try restarting your daprd sidecars after setting global.scheduler.enabled=false. If the sidecars are throwing that log then they likely need to be restarted bc they think scheduler is enabled.

I did a redeploy on our deployment, which started up a new pod after globally disabling the job scheduler. I will see if I can try to upgrade one of our clusters back to v1.14.4 with the scheduler initially disabled to see if we still see the constant log messages.

emmanuel-messegue commented 1 week ago

Hi, I am using dapr in self-hosted mode without docker on linux in slim mode I got the scheduler error too. I found out that if I start manually $HOME/.dapr/bin/scheduler the error disapear. Is it a good idea though ? And I could not find the scheduler process on windows in the $env:HOMEPATH.dapr\bin directory. Where is it please ?

xendren commented 3 days ago

Any ideas when the sidecar portion of this issue will be fixed? We have disabled the job scheduler during install using the global.scheduler.enabled=false, but our sidecar logs are swamped with the "failed to watch scheduler jobs" error messages. We have too many applications using Dapr in k8s to update every one of them with the workaround annotation, and the frequency is killing our log aggregation from k8s.

1.15 for sure, but it looks like we may need to backport this specific fix to 1.14 sooner. @cicoyle what do we need for that? @xendren please ping me in a DM on Dapr Discord

Apologies. We have been too busy this week. We decided to revert to Dapr v1.13.6 and wait until a release without the issues. Before I downgraded, I noticed there were 3 scheduler pods, I believe, that were showing in error due to no persistent volume being available and no persistent volume class available to provision the PVs. I'm guessing that was the cause of our initial error where the daprd sidecar was in a crashback loop with failure to connect.

One follow up. We did notice the scheduler-data-dir PVC's were getting this error after Dapr 1.14.4 install.
no persistent volumes available for this claim and no storage class is set