Distributed scheduler building block and service proposal

cicoyle commented 11 months ago

This design proposes 2 additions:

A Distributed Scheduler API Building Block
A Distributed Scheduler Control Plane Service

See images for high level overview and understanding of the architecture involved.

Please review and lmk what you think 🎉

cicoyle commented 11 months ago

linking to release 1.13

olitomlinson commented 11 months ago

I'm assuming the job name is essentially a unique ID within a given namespace ?

Does the namespace map directly to the existing dapr namespace concept? Or is it something else?

For example, many building blocks that support multi-tenancy can have a scope of {namespace}.{appId} - State Store, PubSub and Dapr Workflows follow this pattern to ensure complete isolation.

So let's assume I have a single namespace (ns), and the namespace contains two Apps (App X and App Y)

Would the following command performed on App X colide with App Y ? : http://localhost:<daprPort>/v1.0/job/schedule/prd-db-backup - Can both Apps have their own isolated job called prd-db-backup or will both Apps be operating on the same job instance?

This may sound somewhat arbitrary, but it's worth having the discussion, because as it stands, Dapr Workflow IDs are scoped to the {namespace}.{appId} - so IMO it would make sense to be consistent by default, unless there is a reason not to.

This would also have the implication that the /job API operations available on the sidecar can only operate on jobs owned by that given App ID, and consequently are not allowed to operate on Jobs that may be owned by other Apps.

Therefore the proposal should try to be more specific around scope. I can see 3 potential non-mutually exclusive options to be considered :

{global} scope - can be operated on by any App in the cluster.
{namespace} scope - can be operated on by any App contained in a specific namespace
{namespace}.{appId} - can only be operated on by a specific app contained in a specific namespace.

olitomlinson commented 11 months ago

I assume this proposal is more biased towards durability and horizontal scaling, over precision.

I.e. we can guarantee that your job will never be invoked before the schedule is due, however, we can't guarantee an ceiling time on when the job is invoked after the due time is reached.

If this is true, it's important that any literature is very clear about this as people will naturally expected perfect precision if its not explicitly stated.

cicoyle commented 11 months ago

I'm assuming the job name is essentially a unique ID within a given namespace ?

Does the namespace map directly to the existing dapr namespace concept? Or is it something else?

For example, many building blocks that support multi-tenancy can have a scope of {namespace}.{appId} - State Store, PubSub and Dapr Workflows follow this pattern to ensure complete isolation.

So let's assume I have a single namespace (ns), and the namespace contains two Apps (App X and App Y)

Would the following command performed on App X colide with App Y ? : http://localhost:<daprPort>/v1.0/job/schedule/prd-db-backup - Can both Apps have their own isolated job called prd-db-backup or will both Apps be operating on the same job instance?

This may sound somewhat arbitrary, but it's worth having the discussion, because as it stands, Dapr Workflow IDs are scoped to the {namespace}.{appId} - so IMO it would make sense to be consistent by default, unless there is a reason not to.

This would also have the implication that the /job API operations available on the sidecar can only operate on jobs owned by that given App ID, and consequently are not allowed to operate on Jobs that may be owned by other Apps.

Therefore the proposal should try to be more specific around scope. I can see 3 potential non-mutually exclusive options to be considered :

{global} scope - can be operated on by any App in the cluster.

{namespace} scope - can be operated on by any App contained in a specific namespace

{namespace}.{appId} - can only be operated on by a specific app contained in a specific namespace.

All jobs will be namespaced to the app/sidecar namespace. We will not support global jobs, unless all the jobs explicitly fall under the same namespace. There is no broadcast call back to all apps in a namespace with this proposal. That might lean towards using the PubSub building block (as a future implementation detail).

cicoyle commented 11 months ago

I assume this proposal is more biased towards durability and horizontal scaling, over precision.

I.e. we can guarantee that your job will never be invoked before the schedule is due, however, we can't guarantee an ceiling time on when the job is invoked after the due time is reached.

If this is true, it's important that any literature is very clear about this as people will naturally expected perfect precision if its not explicitly stated.

You should see this explicitly in the proposal now - specifically in the Goals section. Thanks 👍🏻

olitomlinson commented 9 months ago

Does the implementation of this proposal depend upon this proposal https://github.com/dapr/proposals/pull/50 at all, or are they separate endeavours?

ItalyPaleAle commented 9 months ago

Does the implementation of this proposal depend upon this proposal #50 at all, or are they separate endeavours?

The implementation does have a dependency on #50

olitomlinson commented 9 months ago

I've been re-reading this proposal and I have a few comments

There doesn't appear to be any examples of how one may author a job task i.e. the users code that actually gets invoked at the time the job executes? See below, taken from the proposal, a schedule is created which invokes a db-backup task, but how is this task expressed?

{
  "schedule": "@daily",
  "data": {
    "task": "db-backup",
    "metadata": {
      "db_name": "my-prod-db",
      "backup_location": "/backup-dir"
    }
  }
}

It's my understanding that a motivation of this proposal is to be able to also perform scheduled Service Invocation and scheduled PubSub, and scheduled State Store operations yet I don't see any examples of how this is achieved via an SDK?

_(side bar : how would scheduled service invocation work? who would receive the result? this is exactly where you need Dapr Workflows as an orchestration, which already has a perfect method of doing scheduled and time-implicated workloads)_

Which brings me onto my 3rd point...
Overlap with Workflows

I see a significant overlap in point 2, with things that are already achievable via Dapr Workflows. (albeit, not natively possible in the SDK (yet!), its just a few lines of code to achieve today)

Going a step further... Dapr Workflows is the perfect language for defining schedules. This proposal uses the example of a daily backup routine, which is already achievable using a super simple Monitor/Eternal Orchestration pattern - Temporal.io and many others are leading the way here, particularly from a Developer Experience perspective.

Anecdotally, lots of people use Azure Durable Functions for orchestrating backups & cloud infrastructure, so Dapr Workflows is in good company here for this kind of Ops stuff too.

My 2c, I feel we should be advocating dapr adopters to learn how to express time-implicated workloads via Dapr Workflows which can grow with complexity, rather than pointing adopters to use a declarative job model which will struggle to grow beyond the basic repetition a static schedule (like cron!)
Prioritisation of effort

At the time of writing, we have the challenge of a Dapr Workflow implementation that can't GA until the Actor Reminders problem is solved (... and a heck of a lot has already been sunk into Dapr Workflows across the runtime, docs and several SDKs over the past year!)

Given this challenge, I would like to recommend that the effort related to this proposal is concentrated on doing the absolute bare minimum implementation to solve the Dapr Workflow / Actor Reminder problem first, rather than prioritising the delivery of the Distributed Scheduler Building Block and all the concerns that bringing a new Building Block online actually brings (SDK, Docs, Testing etc)

I think its fair to say that the community deserve investment in existing Building Blocks that are trying to reach GA/Stable, such as Workflows, Crypto, Distributed Lock, before we bring another Building Block to life.

Just to be clear, I'm not against the Distributed Scheduler Building Block API surface, however the overlap with what is easily achievable with Dapr Workflows leaves me wondering if its needed, right now...

yaron2 commented 9 months ago

It's my understanding that a motivation of this proposal is to be able to also perform scheduled Service Invocation and scheduled PubSub, and scheduled State Store operations yet I don't see any examples of how this is achieved via an SDK?

This is out of scope for this proposal. The meaning behind the mention of other building blocks is that users can be notified on a given schedule and then perform things like pub/sub, service invocation and others from within their code. In the future we might be able to integrate scheduling into other building blocks, but for now this is completely out of scope

I see a significant overlap in point 2, with things that are already achievable via Dapr Workflows. (albeit, not natively possible in the SDK (yet!), its just a few lines of code to achieve today)

The purpose of this API is to both satisfy the requirements of actor reminders and Dapr workflows as well as provide users with a simple CRON like API that doesn't require users to fully embrace a completely new programming model. Since cron jobs and scheduled reminders are a very common primitive, it would be useful for users to be able to schedule these easily without the additional overhead of a full fledged programming model. Temporal for example introduced a distributed scheduler primitive in addition to their orchestration patterns

Given this challenge, I would like to recommend that the effort related to this proposal is concentrated on doing the absolute bare minimum implementation to solve the Dapr Workflow / Actor Reminder problem

Proposals do not deal with timelines, what goes into which milestone and priorities. Specifically, it's a yes/no decision whether a design can at any point be implemented in Dapr in the future. The actual priorities for milestones when it comes to which part of a proposal to implement comes down to the feature definition and triage stage that maintainers undergo at the beginning of each milestone

olitomlinson commented 9 months ago

The meaning behind the mention of other building blocks is that users can be notified on a given schedule and then perform things like pub/sub, service invocation and others from within their code

Fair. Although I still don't see any mention of how user-code is activated? I assume the schedule needs to be given a path (similar to how pubsub calls a specific HTTP endpoint for a subscription)

Temporal for example introduced a distributed scheduler primitive in addition to their orchestration patterns

They did!

And interestingly it would appear that they chose to dogfood their own workflows programming model to implement the higher level 'schedule' concept.

With this proposal, it appears we are doing the opposite?

https://temporal.io/blog/how-we-build-it-building-scheduled-workflows-in-temporal

olitomlinson commented 9 months ago

Fair. Although I still don't see any mention of how user-code is activated? I assume the schedule needs to be given a path (similar to how pubsub calls a specific HTTP endpoint for a subscription)

I still would like some clarity on how this all manifests

cicoyle commented 4 months ago

Note: I left the List jobs endpoint in the the proposal, however for 1.14 we will not implement this API due to more work required for pagination.

cicoyle commented 4 months ago

Following are the performance numbers and associated improvements observed in the PoC tracked in PR:

With HA mode (3 scheduler instances), we are able to schedule 50,000 actor reminders with an average trigger qps of 4582. This shows at least a 10x improvement while keeping roughly the same qps. Invoking the Scheduler Job API directly, we observed a qps of ~35,000 for triggering. When creating actor reminders with Dapr 1.13, we saw a qps of 50. With Scheduler, we see a qps of 4,000. This shows a drastic improvement (80x) in the scheduling of actor reminders with Scheduler. Overall, Scheduler is at most limited by the storage in etcd of registered reminders, but not on throughput of reminders

olitomlinson commented 4 months ago

Great work cassie and Team!

This all seems very positive and I can't wait to try it out within the context of Workflows - Particularly running workflow Apps with more than just 1/2 instances, as this is a critical milestone/achievement to GA actors.

I'm actually on Vacay for the next 2 weeks, but I have my Mac with me, so if someone can publish a container image and some instructions how to integrate it, I will happily run it within my docker compose test harness that I've been using for load testing Workflows - DM me on Discord to discuss further :)

ItalyPaleAle commented 4 months ago

@cicoyle what are the perf numbers for executing reminders? The scheduling was always an issue in the current implementation, but more interestingly would be measuring how they are executed

artursouza commented 4 months ago

With HA mode (3 scheduler instances), we are able to schedule 50,000 actor reminders with an average trigger qps of 4582. This shows at least a 10x improvement while keeping roughly the same qps. Invoking the Scheduler Job API directly, we observed a qps of ~35,000 for triggering. When creating actor reminders with Dapr 1.13, we saw a qps of 50. With Scheduler, we see a qps of 4,000. This shows a drastic improvement (80x) in the scheduling of actor reminders with Scheduler. Overall, Scheduler is at most limited by the storage in etcd of registered reminders, but not on throughput of reminders

@cicoyle what are the perf numbers for executing reminders? The scheduling was always an issue in the current implementation, but more interestingly would be measuring how they are executed

That is a great question. If by "executing reminders" you mean the reminders being invoked by the sidecar, then those numbers are present and referred as "triggers" in the comment from @cicoyle. If you are referring to something else for "executing reminders", please, clarify.

Thanks.

ItalyPaleAle commented 4 months ago

I didn't understand those were the "triggers", thanks. In the past we discussed about the workflow perf test too, since that's heavily reliant on reminders

cicoyle commented 4 months ago

When using the new scheduler service, we can confirm that we achieved the ability to increase throughput as the scale grows, in contrast to the existing reminder system that shows a performance degradation as scale increases. As expected, a decrease in some performance metrics is observed with the smallest parallelism and scale (details below), which then increases as parallelism and concurrency grow.

While testing parallel workflows with a max concurrent workflow count of 110 with 440 total number of workflow runs, we saw performance improvements to the tune of 275% for the rate of iterations. Furthermore, Scheduler can achieve a 267% improvement in both the data sent and received, proving that the Scheduler can handle a higher rate of data while significantly decreasing the maximum total time taken to complete the request by 62%. These numbers prove the horizontal scalability of workflows with Scheduler being used as the underlying reminder system.

While testing parallel workflows with a max concurrent count of between 60 & 90, we saw performance improvements of about 71%, where the existing reminder system drops by 44% in comparison, being unable to correct the performance drop as the scale of workflows continues to grow.

At 350 max concurrent workflows and 1400 iterations, we see a performance improvement of 50% higher than the existing reminder system. At the smaller scale of 30 max concurrent workflows and 300 total workflow count we observed a 27% decrease in the rate of iterations and 25% in data sent/received, with a 56% improvement in latency (being the maximum total time taken to complete the request).

As part of the PoC, we ran several performance tests with 100k and 10k workflows to examine other bottlenecks in the Workflow system, unrelated to the scheduling/reminder sub-systems. Several areas were identified as potential bottlenecks for Workflows as a whole and these should be an area of focus in coming iterations.

yaron2 commented 4 months ago

+1 binding

daixiang0 commented 4 months ago

I think we can continue to optimize it for perf issue.

+1 binding

mikeee commented 4 months ago

+1 non-binding

JoshVanL commented 4 months ago

Recusing myself from voting.

artursouza commented 4 months ago

Recusing myself from voting.

dapr / proposals

Distributed scheduler building block and service proposal #44