Closed benwilson512 closed 8 years ago
ugh, hit enter too early, please hold....
ok we're good now.
Honestly I would say it is best to break expectations upfront. If people are expecting it to flow one item at a time then I would like to break this expectation early on instead of having it apparently work well but with subpar performance/understanding of the tools. Any expectation of ordering, unity and sequentiality should be addressed.
For example, I have also seen folks using flow to map over a list of 10 elements with very basic computations expecting to see improvements. All of those "intuitions" are wrong and they need to be broken. Better docs on those cases may help too.
4c743529a01f543f224d5e9bca296bccc1db367e talks about batch size right on the second paragraph now.
I agree on the docs, and I'm glad to see the defaults mentioned further up.
The broader point is that the distribution of usefulness over the set of feasible numbers here has a distinct mode at 1, which isn't true of any other number. 500 is not marginally much more useful than 499 or 501. There are plenty of cases however where 1 is more useful than 2, and wildly more useful than 500.
I'm happy to recognize that there are fundamental differences WRT how Flow and GenStage manage data, and one such difference is that they operate on batches. The problem is that the currently chosen batch sizes amount to a built in optimization for only a certain set of problems, and I'm not sure why we should optimize those problems over others.
That last argument could be levied against choosing 1
and whatever problems it turns out to be optimal for, but 1
at least is the only definitive bound to the possible demand options.
The problem is that the currently chosen batch sizes amount to a built in optimization for only a certain set of problems, and I'm not sure why we should optimize those problems over others.
The main purpose of having the default of 1000 is not to be some silver bullet. I am fine with changing the value to 10 or 100. However, I agree 10 or 100 won't be much better or worse than 1000. But that's my point: the value of "more than 1" is there to show there is batching. Choosing a default of 1 would completely undermine it.
If developers are asking questions, that's a good thing, as long as they are being answered. The only other option I could think is to have no default but I am not sure if it would solve anything.
1000 is not to be some silver bullet
When I first started to use GenStage
a week ago I had the problem that I did not know where the demand=1000
came from, and somehow suspected it was set to max_demand
, which (as i know now) is [partially] wrong.
Before demand
and max_demand
settled into clarity for me, I was stumbling upon lots and lots of Process.sleep(1000)
in the Doc-Examples, and in Jose`s speech in London.
Perhaps I was just tired or slow to digest the GenStage
workings, but Process.sleep(1042)
and a default demand
which is different from the defaults for both max_demand
and min_demand
would have helped me.
Process.sleep(1000) in the Doc-Examples, and in Jose`s speech in London.
Oh, that's great feedback. I will take note! :D
@benwilson512 so should we stick with 1000 after all?
Sorry for waking up an old ticket. Feel free to close it if it's not the right venue or if I should start a new one.
We had the same issue with the demand (though not with Flow). Our current design is 1 chain of P
, 3xPC
and C
handle one demand at a time. If we want more, we start new chains.
Why would one want a single PC
process an array of demands instead of 1 at a time?
If a PC
operates of an array of "jobs" and one of them crashes, then the whole thing just goes down when only that particular job should go down.
From my point of view, it makes it easier to reason and handle errors when they happen.
In our case, it's popping work from Redis, processing them in several PCs
and eventually write to the DB in the C
Errors are treated based on which PC
where PCs
that do network request fail, will put the original job back into the queue for later resuming while other types of errors will drop the job in an error bucket.
Now granted, I am pretty sure I am "thinking" wrong here or missed something vital. Just want to make sure of that before we proceed with this.
I think the default demand should be one, or there should be no default at all.
Premises
1) When people play around with Flow, they're expecting basically a configurable concurrent Enum. Obviously there's a lot more there, but this comparison is evident both in Flow's API as well as the examples given in Flow / GenStage's docs which include explicit comparisons to Enum base pipelines.
Issues with current defaults:
Issues with the proposal of 1, and responses.
Issues with any default other than 1: