chaps-io / gush

Fast and distributed workflow runner using ActiveJob and Redis
MIT License
1.03k stars 103 forks source link

Gush falls down under large-scale workflows #35

Closed carlthuringer closed 6 years ago

carlthuringer commented 7 years ago

Hi, I was super excited to integrate gush into my latest project, but just towards the end I hit a wall and cannot proceed any further.

The basis for my issue is twofold; Gush#configure must be deterministic because it runs on every job setup, and when a workflow has thousands of jobs, execution becomes so slow as to defeat the benefit of parallelism.

My gush workflow has ~4000 jobs, and the dependency graph is quite deep. The process of generating the graph each time a job is started causes jobs to take several seconds to actually execute. I experimented with caching the dependency graph in redis, but that actually makes the problem worse, not better, because of the serialization/deserialization overhead.

So, in conclusion, Gush is not able to handle workflows multiple orders of magnitude larger than the example. It seems that 10s or 100s of nodes is doable, but the overhead to loading a workflow with 1000s of nodes is too much.

Have you any thoughts or experience with such huge workflows? Advice? I'm back to the drawing board, maybe cooking up something with basic Sidekiq more suited to my needs.

pokonski commented 7 years ago

Hi @carlthuringer! That is indeed an extreme case. Do you have 4k unique job classes or are you generating those dynamically? This sounds more like you need simpler batching and/or splitting into smaller workflows.

carlthuringer commented 7 years ago

The workflow I'm attempting to orchestrate is a "Seeding" workflow. Just imagine a typical rails monolith riddled with vast and complex belongs_to and has_many relationships, and you'll have some idea what I'm trying to produce.

Generating all the data in one shot is unreliable and time-consuming. There's no way I can do it in a transaction, so logically the next step is to break it down into idempotent stages. Gush seemed like a great solution to elegantly declare what data needed to be created, and which data depended upon which, all without needing to juggle specific references or use confusing flow control.

I've got about 18 job classes and I'm using a lot of looping to run MyJobs with various after: [batch, of, jobs] type configuration. In the leaves I end up with M * N * O combinations that I want to generate data for and orchestrate with Gush.

The biggest advantage to using Gush is that I can specify complex convergent workflows, which is impossible with Sidekiq Batches. Without the convergence, I have to have several breakpoints where data is prepared, the system arrives in a known state, and then the next stage is started. The programming becomes more procedural and literal, instead of the elegant and implicit way I can set up all the initial data with Gush, and then have various downstream jobs dependent on initial data waiting for those jobs to complete.

pokonski commented 7 years ago

Thanks for the detailed explanation. There are ways to improve storing the state I will explore. My colleague here at Chaps suggested some good ideas which we will try to implement as the current one is rather naive (we never expected such large workflows :) )

Stay tuned!

pokonski commented 6 years ago

@carlthuringer year later but I just pushed a change to activejob (will be in 1.0) branch which greatly reduces time needed to spawn hundreds or thousands of jobs. Overall should speed up execution by a lot.

pokonski commented 6 years ago

Closing this as 1.0.0 was released with major improvements to performance. If issue still occurs please open a new ticket.