Closed Trevoke closed 5 years ago
I think this would clear a lot of things up for new adopters including myself
I think max_demand
and min_demand
are more suitable than these suggestions because these options configure the maximum demand in flight and minimum demand in flight. These are from the perspective of the consumer, which is the process that sends the demand. The consumer will try its best to keep demand in flight between these thresholds.
@fishcakez
The situation I run into all the time is that unless you know precisely what the correct perspective is, these two option names do not help you understand what the correct perspective is. I come to this from the perspective of someone who hasn't really done map/reduce work before, so there may be an industry standard in the naming that we shouldn't change lest we confuse those who are familiar with the industry standard, but failing that...
What if they were called :high_watermark
and :low_watermark
? This would be a shift in the meaning of :min_demand
though:
When implementing consumers, we often set the
:max_demand
and:min_demand
on subscription. The:max_demand
specifies the maximum amount of events that must be in flow while the:min_demand
specifies the minimum threshold to trigger for more demand. For example, if:max_demand
is 1000 and:min_demand
is 750, the consumer will ask for 1000 events initially and ask for more only after it receives at least 250.
would become
When implementing consumers, we often set the
:high_watermark
and:low_watermark
on subscription. The:high_watermark
specifies the amount of events that the stage is able to work with at once, while the:low_watermark
indicates when the stage will request more events. For example, if:high_watermark
is 1000 and:low_watermark
is 250, the consumer will ask for 1000 events initially and ask for more only after it receives at least 250.
I think the switch to high_watermark
and low_watermark
as in your description would be more confusing because the high watermark would be for demand and low watermark for events? We would want the values to measure the same things if they have the same naming. The low watermark should still be the min demand, not the difference.
The perspective I'm going for is "code should be written by humans for humans" (lofty!), and my corollary to this is "who cares how the computer understands it as long as both human and machine are able to agree on what happens" (lofty again!). One of the things I try to remain very aware of is that almost all of GenStage is based entirely on machine configuration because configuring GenStage is about configuring it for peak efficiency, and that is entirely based on the work and the hardware. This means that the perspective I'm going for might be completely invalid... But I'm not sure, at least in this example.
Ultimately, there's three values here that matter, and I think they can be described this way:
If these descriptions are incorrect, then please help me fix them. They are based on my leftover understanding from the conversation we had on the Elixir Slack when you cleared up my misconception on the meaning of :min_demand
and :max_demand
.
If I rewrite these to bring in "min demand" and "max demand", here's what I end up with:
max_demand - min_demand
events.I think the two are the same thing? But I find the second one significantly more circuitous.
I am not sure I understand either of descriptions. From my mental model of GenStage there are only two values that matter. They can be thought of in two different ways.
Or
Currently we use the later model. I think you are advocating for the first though. In that case we still have max_demand
but would be changing min_demand
to something like ask_every
, which would be max_demand - min_demand
.
I am currently advocating for the first model, yes. Something like ask_every
would be better. I'm trying to think of how I would explain this to new people, or how the documentation would read.
:max_demand
is the maximum number of events that must be in flow through this stage, while:ask_every
is the minimum number of events that the stage must process before it will ask its producer for more events.
I think that's simple enough.
So the problem is that a consumer will first ask for max_demand
and then after that it will only ever ask for max_demand - min_demand
. Therefore it is confusing because that value last could also be thought of as the min_demand
since the consumer will only ask for either of those 2 values. This is a solid argument for changing the min_demand
configuration to a different name or for configuration to move the other model.
We have two questions to answer here:
Did we choose the best configuration parameters, regardless of the name?
What are the best names for those parameters?
Today we have two parameters, the "maximum amount I will ever ask" and "the low watermark of events I need to have before I ask for more". I believe those are the best parameters. The former reveals the information the producer is receiving. It is very transient for the consumer, since statistically speaking you are rarely at max, but accurate to the consumer. And although we can express the second in different ways, I prefer the current mechanism because it tells about the consumer capacity and queue. Something like "ask_every: 20" may give us the impression that are only 20 events in flux but the events in flux are actually what is being requested plus what is currently in the consumer queue.
So I think the current parameters, regardless of name, help us understand both sides of the producer-consumer relationship. They are the most intention revealing. If we can't agree those are best parameters, then we definitely can't agree on the name.
Assuming we can agree on those parameters, then we can talk about names and I agree we need to change at least one of max_demand
or min_demand
. They are talking about different things. max_demand
is the maximum demand you will send. min_demand
is not the minimum demand you will send, the minimum demand is actually given by max_demand - min_demand
.
We should also not be tempted to necessarily change both names at the same time, or make they mirror each other, otherwise we will fall into the same trap we have fallen in the initial design.
My proposal is to stick with max_demand
, because it is quite literally the maximum demand we send, and rename min_demand
to something else. When coming up with function and option names, I like to start with the most verbose name and then figure out the smaller version for it. There are also two idioms we could use "ask for more events" or "send demand", which are both used by GenStage. Here are some ideas:
:ask_for_more_on_threshold
or :ask_for_more_at
:send_demand_on_threshold
or :send_demand_at
:min_demand_on_threshold
or :min_demand_at
More suggestions?
I'm a newbie to genstage, been really difficult to understand in detail how it works. From what I understand, which I could be totally wrong
You have your "initial demand amount" aka max_demand then you have your "processing chunk size" aka max_demand - min_demand then you have "ask for more chunks when current chunks sizes >= value aka min_demand
Couldn't "processing chunk size" be set to any number?
Also is max_demand only used because you want to send as much as you can the first emit to consumers?
I think what is most confusing is, in my mind the thing that you want to visualize the most is the chunk size, but the name for it is max_demand - min_demand
Hi @mikeni! Processing chunk size is not accurate because we may process smaller chunks.
Remember there is no guarantee the producer will send max_demand
when you ask for max_demand
. You may ask for 100 and it may send 7, 3, 15, etc. Those will all be your "chunk sizes".
This also means that max_demand
is not necessarily about the first batch. It is rather a mechanism to avoid requesting entries time after time, which would end-up generating many events (and therefore a bunch of overhead).
I see, thank you for explaining.
I have a question about the buffer mentioned in the docs. If there is a producer and consumer with min_demand 400 and max_demand 700.
It looks like the consumer will call handle_events with a size of 300 twice.
Does that entire 700 events from the producer get delivered to the consumer which has a queue (buffer) ? or is the queue (buffer) on the producer itself, so even though the handle_demand returns 700 events, it only sends out 300 at a time?
Does that entire 700 events from the producer get delivered to the consumer which has a queue (buffer) ?
If the producer has 700 and sends 700, then they are all delivered at once.
The consumer will then process all of them at once, without putting them in a buffer or anything.
But then you can say: "but @josevalim, my handle_events
callback is called only with 300 events". That's because the consumer knows that it can ask for more events after 300 are consumed, so it first consumes the 300 and then it consumes the 400 remaining. That is done without any buffering. We just break that large batch in two parts to make sure we can ask for more events as soon as possible.
gotcha thanks
Closing as we are heading towards 1.0 and we will make do with the vocabulary we have today.
There's some assumption that gets made when people read
:max_demand
and:min_demand
about what that means or what will happen or what could happen. I've seen a number of folks confused (myself included), and it's always a little thorny to explain it.How about a slight change to what we refer to? Something like this, maybe?
:max_events_handled
as a new name for:max_demand
:ask_for_events_at
which would be the number of events in the stage at which point it would ask for more.In the current scenario, we would have:
max_demand: 1000, min_demand: 700
and we would ask for 300 more events when we hit the low watermark of 700 events. We could have:max_events_handled: 1000, ask_for_events_at: 700
which would also mean we would ask for 300 more events when we hit the low watermark of 700 events.Another option would be to do something like:
max_events_in_stage
formax_demand
max_events_in_flight
for what is currently computed asmax_demand - min_demand
.What do you think?