chaps-io / gush

Fast and distributed workflow runner using ActiveJob and Redis
MIT License
1.04k stars 104 forks source link

Scaling to thousands of jobs in workflow #50

Closed dmitrypol closed 6 years ago

dmitrypol commented 6 years ago

I really like gush library and want to use it for is tasks like importing records. The workflow will not be very complex but there will be LOTS (many thousands) of small identical jobs. I am not sure how this library will scale. After studying code I see that it stores ALL jobs inside workflow Redis record. That record will get huge and serializing it to/from JSON will slow things down.

I realize that this is probably not a small change but is it possible to use gush.jobs.WORKFLOW_ID.ImportCsvRowJob-JOB_ID keys instead? Why does job data need to be duplicated inside workflow key AND job keys? Looking in workflow.rb class I see how it looks up jobs but is it possible to look up separate job keys (they have workflow_id in the key).

Has anyone tried something like that? Thank you very much.

Here is my basic workflow:

class ImportWorkflow < Gush::Workflow
  def configure    
    run DownloadCsvJob
    csv_jobs = CSV.foreach("path/to/data.csv").map do |row|
      run ImportCsvRowJob, params: row, after: DownloadCsvJob
    end
    run GenReportJob, after: csv_jobs
  end
end
pokonski commented 6 years ago

You are correct in thinking the job data is not needed to be stored in full, inside the workflow. It's mostly there for easier deserialization from The Old Days.

The problem is exactly what you say it is, because Gush was never intended as something to scale huge workflows dynamically. It was meant for fixed in size, but long running workflows for data normalization.

Regarding your code example, for less overhead I would recommend scaling CSV processing inside the job using something like https://github.com/tilo/smarter_csv for batch processing and https://github.com/grosser/parallel for processing those batches in parallel.

Gush's overhead (and Sidekiq's for example) will always be bigger than pure threading as it was not meant for it.

GenReportJob relies on all of the jobs finishing so the processing might as well be a single job.

Anyway, I will mark it as enhancement because there is definitely a way to optimize storage - but will require a significant chunk of the code to be changed.

dmitrypol commented 6 years ago

Thanks for responding so quickly. I understand that this would be a big change and my use case is different than what you envisioned. I have used smater_csv and parallel. The problem is that it still leaves all those individual record imports inside ONE job. What if it crashes or I need to deploy code (which restarts the process)? I personally prefer scaling out with lots of smaller jobs running across multiple servers / processes.

pokonski commented 6 years ago

True, I agree with you it would be ideal to scale on the job level even. More granularity here would help especially when retrying the workflow (a single job instead of the whole process).

pokonski commented 6 years ago

Just an update: version 1.1 no longer serializes whole jobs inside workflow JSON, so they are now fully separated.

dmitrypol commented 6 years ago

thank you very much, I will check it out shortly.