elixir-lang / gen_stage

Producer and consumer actors with back-pressure for Elixir
http://hexdocs.pm/gen_stage
1.51k stars 192 forks source link

Suggestion: new language instead of max_demand and min_demand #201

Closed Trevoke closed 5 years ago

Trevoke commented 6 years ago

There's some assumption that gets made when people read :max_demand and :min_demand about what that means or what will happen or what could happen. I've seen a number of folks confused (myself included), and it's always a little thorny to explain it.

How about a slight change to what we refer to? Something like this, maybe?

  1. :max_events_handled as a new name for :max_demand
  2. :ask_for_events_at which would be the number of events in the stage at which point it would ask for more.

In the current scenario, we would have: max_demand: 1000, min_demand: 700 and we would ask for 300 more events when we hit the low watermark of 700 events. We could have: max_events_handled: 1000, ask_for_events_at: 700 which would also mean we would ask for 300 more events when we hit the low watermark of 700 events.

Another option would be to do something like:

  1. max_events_in_stage for max_demand
  2. max_events_in_flight for what is currently computed as max_demand - min_demand.

What do you think?

mikeni commented 6 years ago

I think this would clear a lot of things up for new adopters including myself

fishcakez commented 6 years ago

I think max_demand and min_demand are more suitable than these suggestions because these options configure the maximum demand in flight and minimum demand in flight. These are from the perspective of the consumer, which is the process that sends the demand. The consumer will try its best to keep demand in flight between these thresholds.

Trevoke commented 6 years ago

@fishcakez

The situation I run into all the time is that unless you know precisely what the correct perspective is, these two option names do not help you understand what the correct perspective is. I come to this from the perspective of someone who hasn't really done map/reduce work before, so there may be an industry standard in the naming that we shouldn't change lest we confuse those who are familiar with the industry standard, but failing that...

What if they were called :high_watermark and :low_watermark? This would be a shift in the meaning of :min_demand though:

When implementing consumers, we often set the :max_demand and :min_demand on subscription. The :max_demand specifies the maximum amount of events that must be in flow while the :min_demand specifies the minimum threshold to trigger for more demand. For example, if :max_demand is 1000 and :min_demand is 750, the consumer will ask for 1000 events initially and ask for more only after it receives at least 250.

would become

When implementing consumers, we often set the :high_watermark and :low_watermark on subscription. The :high_watermark specifies the amount of events that the stage is able to work with at once, while the :low_watermark indicates when the stage will request more events. For example, if :high_watermark is 1000 and :low_watermark is 250, the consumer will ask for 1000 events initially and ask for more only after it receives at least 250.

fishcakez commented 6 years ago

I think the switch to high_watermark and low_watermark as in your description would be more confusing because the high watermark would be for demand and low watermark for events? We would want the values to measure the same things if they have the same naming. The low watermark should still be the min demand, not the difference.

Trevoke commented 6 years ago

The perspective I'm going for is "code should be written by humans for humans" (lofty!), and my corollary to this is "who cares how the computer understands it as long as both human and machine are able to agree on what happens" (lofty again!). One of the things I try to remain very aware of is that almost all of GenStage is based entirely on machine configuration because configuring GenStage is about configuring it for peak efficiency, and that is entirely based on the work and the hardware. This means that the perspective I'm going for might be completely invalid... But I'm not sure, at least in this example.

Ultimately, there's three values here that matter, and I think they can be described this way:

If these descriptions are incorrect, then please help me fix them. They are based on my leftover understanding from the conversation we had on the Elixir Slack when you cleared up my misconception on the meaning of :min_demand and :max_demand.

If I rewrite these to bring in "min demand" and "max demand", here's what I end up with:

I think the two are the same thing? But I find the second one significantly more circuitous.

fishcakez commented 6 years ago

I am not sure I understand either of descriptions. From my mental model of GenStage there are only two values that matter. They can be thought of in two different ways.

Or

Currently we use the later model. I think you are advocating for the first though. In that case we still have max_demand but would be changing min_demand to something like ask_every, which would be max_demand - min_demand.

Trevoke commented 6 years ago

I am currently advocating for the first model, yes. Something like ask_every would be better. I'm trying to think of how I would explain this to new people, or how the documentation would read.

:max_demand is the maximum number of events that must be in flow through this stage, while :ask_every is the minimum number of events that the stage must process before it will ask its producer for more events.

I think that's simple enough.

fishcakez commented 6 years ago

So the problem is that a consumer will first ask for max_demand and then after that it will only ever ask for max_demand - min_demand. Therefore it is confusing because that value last could also be thought of as the min_demand since the consumer will only ask for either of those 2 values. This is a solid argument for changing the min_demand configuration to a different name or for configuration to move the other model.

josevalim commented 6 years ago

We have two questions to answer here:

  1. Did we choose the best configuration parameters, regardless of the name?

  2. What are the best names for those parameters?

Today we have two parameters, the "maximum amount I will ever ask" and "the low watermark of events I need to have before I ask for more". I believe those are the best parameters. The former reveals the information the producer is receiving. It is very transient for the consumer, since statistically speaking you are rarely at max, but accurate to the consumer. And although we can express the second in different ways, I prefer the current mechanism because it tells about the consumer capacity and queue. Something like "ask_every: 20" may give us the impression that are only 20 events in flux but the events in flux are actually what is being requested plus what is currently in the consumer queue.

So I think the current parameters, regardless of name, help us understand both sides of the producer-consumer relationship. They are the most intention revealing. If we can't agree those are best parameters, then we definitely can't agree on the name.

Assuming we can agree on those parameters, then we can talk about names and I agree we need to change at least one of max_demand or min_demand. They are talking about different things. max_demand is the maximum demand you will send. min_demand is not the minimum demand you will send, the minimum demand is actually given by max_demand - min_demand.

We should also not be tempted to necessarily change both names at the same time, or make they mirror each other, otherwise we will fall into the same trap we have fallen in the initial design.

My proposal is to stick with max_demand, because it is quite literally the maximum demand we send, and rename min_demand to something else. When coming up with function and option names, I like to start with the most verbose name and then figure out the smaller version for it. There are also two idioms we could use "ask for more events" or "send demand", which are both used by GenStage. Here are some ideas:

More suggestions?

mikeni commented 6 years ago

I'm a newbie to genstage, been really difficult to understand in detail how it works. From what I understand, which I could be totally wrong

You have your "initial demand amount" aka max_demand then you have your "processing chunk size" aka max_demand - min_demand then you have "ask for more chunks when current chunks sizes >= value aka min_demand

Couldn't "processing chunk size" be set to any number?

Also is max_demand only used because you want to send as much as you can the first emit to consumers?

I think what is most confusing is, in my mind the thing that you want to visualize the most is the chunk size, but the name for it is max_demand - min_demand

josevalim commented 6 years ago

Hi @mikeni! Processing chunk size is not accurate because we may process smaller chunks.

Remember there is no guarantee the producer will send max_demand when you ask for max_demand. You may ask for 100 and it may send 7, 3, 15, etc. Those will all be your "chunk sizes".

This also means that max_demand is not necessarily about the first batch. It is rather a mechanism to avoid requesting entries time after time, which would end-up generating many events (and therefore a bunch of overhead).

mikeni commented 6 years ago

I see, thank you for explaining.

I have a question about the buffer mentioned in the docs. If there is a producer and consumer with min_demand 400 and max_demand 700.

It looks like the consumer will call handle_events with a size of 300 twice.

Does that entire 700 events from the producer get delivered to the consumer which has a queue (buffer) ? or is the queue (buffer) on the producer itself, so even though the handle_demand returns 700 events, it only sends out 300 at a time?

josevalim commented 6 years ago

Does that entire 700 events from the producer get delivered to the consumer which has a queue (buffer) ?

If the producer has 700 and sends 700, then they are all delivered at once.

The consumer will then process all of them at once, without putting them in a buffer or anything.

But then you can say: "but @josevalim, my handle_events callback is called only with 300 events". That's because the consumer knows that it can ask for more events after 300 are consumed, so it first consumes the 300 and then it consumes the 400 remaining. That is done without any buffering. We just break that large batch in two parts to make sure we can ask for more events as soon as possible.

mikeni commented 6 years ago

gotcha thanks

josevalim commented 5 years ago

Closing as we are heading towards 1.0 and we will make do with the vocabulary we have today.