Netflix / conductor

Conductor is a microservices orchestration engine.
Apache License 2.0
12.81k stars 2.34k forks source link

Roadmap for fast/high-throughput execution workflow #1958

Closed imaffe closed 3 years ago

imaffe commented 4 years ago

Our team is searching for a workflow execution engine that can orchestrate chains of microservice calls with less than 100ms overhead and can support 1k execution per second (on single machine). This engine are very likely to become sort of integration layer of many hidden microservice and is aimed at processing very large volume of requests/events (perhaps 100k per second).

I looked at AWS Step functions Express workflows AWS Step Functions. I found that the Standard workflow are very similar to what Conductor is like now. But not many open source can provide the same throughput like the Express Workflow (support 100,000+ requests per second).

according to ISSUE1183, conductor faces the same challenge and has the plan to to provide performance guarantee similar to AWS Express Workflows. So I'm wondering if there is a road map of what was the design , when will it come out or what our community developers can do for this big improvement..

BTW I searched a lot of other workflow engines to find similar products as Express Workflow, including Jenkins, Airflow, Azakaban, Camunda, Fisson Workflows, GoCD. The architecture of these systems are all "scheduler+polling" style. This is because in use cases including CI/CD, media processing, the workload requires exact-once execution so persisting each step/task state is necessary. And different tasks could be executed on different machines. I believe is why "scheduler + polling" is applied in such scenarios.

However in my use cases, we want use the workflow engine to reduce the repetitive work of writing actual code to pull data from APIs, assemble them and call next API. We want this process done through configuration so that the operation team (who might not know how to code) or developers can avoid writing repetitive code.

In this case, each execution is very similar to a normal RESTful/gRPC request, and it only needs at-least once execution. (Idempotency guaranteed by downstream APIs), then we don't want schedule simple API calls on different machines. Actually we kind of expect the request/event to go through a chain of Tasks/Deciders/Parallels (like the filter/Responsibility Chain pattern) and different execution of a workflow can reuse the in-memory data structures (in Java, Objects) just like the Java Servlet filter implementation. An execution are only executed on single machine, and parallelism is based on branches (each branch is executed using one thread, all tasks on the same branch is executed by the same thread). Obviously this is a bit different from the architecture the conductor has right now (I would say it's more like a functional computing framework like AWS Lambda but only support workflow DSL language/framework).

Any comments or guidance or discussion is welcome, and it would be so nice if we can have access to some of your plans ~

aravindanr commented 4 years ago

Our team is searching for a workflow execution engine that can orchestrate chains of microservice calls with less than 100ms overhead and can support 1k execution per second (on single machine).

Can you elaborate on the "single machine" part? Why it has to be on a single machine?

In this case, each execution is very similar to a normal RESTful/gRPC request, and it only needs at-least once execution. (Idempotency guaranteed by downstream APIs), then we don't want schedule simple API calls on different machines.

This sounds very similar to GraphQL.

Conductor can be scaled to the level that you are looking for provided you pick the right dependencies. For this volume, Redis could be chosen and if persistence/auditing are not a concern, indexing (Elasticsearch) can be disabled. If its single machine, a LocalOnlyLock could be used instead of Zookeeper. We are happy to help in any way possible if you want to explore. further.

imaffe commented 4 years ago

Thanks your answer is helping us a lot

Can you elaborate on the "single machine" part? Why it has to be on a single machine? No what I mean is on average each machine can handle 1k request. (Perhaps many many more machines~)

Yeah I disabled all possible databases and elasticsearch, I saw some improvements and are still exploring on other parameters~

imaffe commented 4 years ago

indexing (Elasticsearch) can be disabled

Do you know how to disable es ? I set async.index.enable=true, so it is not writing task state updates synchronously to the es, not sure if this is the right way to do it~

imaffe commented 4 years ago

Alsi, I'm wondering if we can set the jetty thread pool size. I read from Jetty documentation that the minthread = 8and max thread = 200, we might want something higher than that number.

apanicker-nflx commented 4 years ago

Do you know how to disable es ?

ES can be disabled by setting the property workflow.indexing.enabled to false.

Configuring the size of the jetty threadpool is not currently supported, however, this can be added. If this is something that is of value to you, please feel free to open a PR to make these configurable in the jetty server.

github-actions[bot] commented 3 years ago

This issue is stale, because it has been open for 45 days with no activity. Remove the stale label or comment, or this will be closed in 7 days.

github-actions[bot] commented 3 years ago

This issue was closed, because it has been stalled for 7 days with no activity.

giacomomagini commented 2 years ago

@imaffe Did you find the way to reach 100ms for 1k executions per second? Do you also have some numbers when using more than one machine?

@aravindanr Can you share any number about overhead when running Conductor at different scales like for 1k, 10k, 100k execution per seconds? The workflow I'm talking about would be 25 tasks split across 5 levels ( e.g. 5 in parallel -> 5 in parallel -> 5 in parallel -> 5 in parallel -> 5 in parallel)