Dynamic stage count - Githubissues

chasers commented 5 years ago

I have a need to dynamically increase or decrease the processor and/or batcher stage count depending on throughput.

Or is there a suggested best practice? A monitor process that restarts the pipeline based on current throughput?

Thanks! Broadway is rad :)

msaraiva commented 5 years ago

A monitor process that restarts the pipeline based on current throughput?

yeah, that would be the only way for now. Just keep in mind that restarting the whole thing might take some time depending on the number of stages, the size of the buffers and number of in-flight messages in the pipeline.

chasers commented 5 years ago

Got it ... should be easy enough!

josevalim commented 5 years ago

Chase, can you please explain the need to increase the number of stages? In theory, everything will be fine if you “overspec” your pipeline. The VM is really good at managing idle processes and GenStage really good at handling demand. Have you seen something telling you otherwise? --

José Valim www.plataformatec.com.br Skype: jv.ptec Founder and Director of R&D

chasers commented 5 years ago

Thanks José!

Basically I'm just taking in events and storing them in BigQuery. We're doing some other stuff like schema management, type checking, etc but basically it's just a BigQuery pipeline.

Each user has sources. Each source has it's own Broadway pipeline. This is creating a lot of processes and some sources have significantly more throughput than others. Like 2_000 per second vs 1 or zero mostly. To handle 2k per second I can't increase all pipelines or I would have process limit issues very soon.

And I think this is why we're running into the BEAM busy wait thing. Which isn't an issue per se, unless we want to autoscale on CPU. Ultimately we'll be autoscaling on run queue (I think) but that is a bit more work.

Anyways, the main reason I like this model is becuase one source can't interfere with another. If some odd payload or BigQuery response crashes a pipeline, only one part of one customer is affected and not all pipelines for all customers.

Now ultimately we still will have issues when there are say 200_000 sources in the system but I think we can pick 3 of N nodes for source processes to live on. And by that time this will be a good problem to have!

josevalim commented 5 years ago

May I ask why not a single pipeline? The number of machine resources is limited, creating the extra topologies may not help besides by drastically increasing the amount of IO (or by introducing an intermediate process that will most likely slow things down).

Also, the busy wait should not impact this. Unless the machines are not busy at all and you are using something like Kubernetes pods. But it can be easily disabled. --

José Valimwww.plataformatec.com.br http://www.plataformatec.com.br/Founder and Director of R&D

chasers commented 5 years ago

The main reason I like this model is because one source can't interfere with another. If some odd payload or BigQuery response crashes a pipeline, only one part of one customer is affected and not all pipelines for all customers.

Maybe at this point I'd be comfortable moving to one pipeline now that we're doing this reliably at a decent scale. Having independent pipelines has really helped edge cases not affect everyone.

Yeah we're moving to k8s pods. Turning that off altogether is probably a good idea.

chasers commented 5 years ago

Feel free to just close this if you want. Creating a monitor proc is easy enough, and dynamic counts would probably just confuse people.

josevalim commented 5 years ago

Yeah, I will do that. We will keep paying attention to use cases and we can always revisit in the future.

dashbitco / broadway

Dynamic stage count #119