Open rkuhn opened 11 years ago
I think this is a good start. I'll add some further thoughts and more concrete examples in subsequent comments that can maybe help the conversation.
Thanks!
Behaviors (Timed Production)
An example of these are metrics such as counters or snapshots of rolling latency percentiles.
Generally the desire is for the consumers to get the most recent values as often as possible. The discrete events emitted are artificial and thus the timing can be adjusted to be as fast or slow as the consumer can handle.
A more concrete example that I'm dealing with is as follows:
Let's say we have 200 servers in the morning that are emitting streams of metrics and have a consumer that aggregates them for consumption by a data visualization. During off-peak hours this can be done with timed production of 500ms between "pushes" from each of the 500 servers. Each push is a burst of several hundred metrics.
Approaching peak evening hours for simple math the number of servers hits 1000 so the number of streams and events has multiplied by 5x and the single machine consuming those streams can no longer handle it all.
We have a variety of options:
1) [Manual Operations] We can configure the streams to always emit more slowly so they are configured to be handled by the single consumer even at peak traffic. This increases latency of the data even when the machines are well underutilized.
2) [Manual Operations] We can add multiple consumers in a tiered fashion to be capable of handling peak traffic in low latency, but during off-hours the boxes are underutilized.
3) [Global Flow Control] We can have the single consumer provide a feedback loop to the producers to speed up or slow down depending on it's capacity. Thus in off-peak hours it could have the latency at 500ms between events and peak hours move it up to 2000ms.
4) [Global Flow Control] We can have the consumer split itself (fork-join style) when it reaches capacity so the producers continue emitting at 500ms but we go from 1 consumer to multiple in a tiered setup.
5) [Local Flow Control] Drop events (head dropping or drop whole queue) on the client (assuming it is cpu and not network that is saturated) if the consumer falls behind.
Events
Different than metrics/behaviors, events are triggered by something such as incoming user requests, anomaly detection algorithms, etc.
How to handle event streams depends on the business case for the data.
I can make choices such as:
1) Drop Data
Perhaps in a large cluster it's acceptable to drop data (preferably with metrics about how much I've dropped) such as log events, exceptions thrown, alerts as sampling/throttling is sufficient to get a valid signal.
In this case it is often an optimization of cost to not scale up infrastructure sufficient to handle all events but instead just sample.
2) Queue Remotely
If dropping data is not acceptable (ignoring fault tolerance for now, just capacity) and usage spikes occur (or just normal peak traffic) that overwhelm consumers, and it's okay for consumption to be latent I may queue the events in a remote system such as Kafka or HDFS for later consumption when consumers catch up.
3) Queue Locally
If a stream is typically capable of being handled by consumers but have bursts (average 100rps, but can burst all within a 10ms window for example) then local queues on each consumer may be sufficient and not need more (again ignoring fault tolerance for now in case of events being mission critical which changes the decisions).
4) Autoscale
If events need to be processed in low-latency and infrastructure can scale (prioritize work on a finite cluster, or scale up in an environment such as public cloud) then the consumers can scale up as the numbers of events increase.
The producers can emit events via a load balancer (software, hardware, client or network based) in a complete push model, or it can be decoupled in a push/pull model such as by using Kafka.
5) Over Allocate
I can always just allocate resources high enough to handle whatever the highest peak traffic volumes are going to be (and shed load if that gets overwhelmed for unexpected reasons).
A non-trivial complication of all of these is that we want to allow developers to build complicated transformations and combinations of streams and their application of logic can easily break many of the back-pressure solutions if implicit or even explicit queueing happens.
These are beyond the client/server network points and thus the in-memory (single process) reactive stream needs to retain the feedback loop even if custom operators or async observers are used. This is still a problem area in Rx, even if the portions at the "edge" of the client/server relationship are solved.
I can post some sample code of these types of examples if that is useful.
Those are great use cases to consider, thanks for elaborating! I think the aspect of wanting to sample a practically continuous source as quickly as processing allows (up to a limit) is a genuine case that is not yet covered, would you mind adding that to the wiki?
And the aspect of reacting to load by modifying the routing of substreams (elastically) is another good one to think about: this is probably a common enough case that providing mechanisms or hooks for translating back-pressure into a metric makes sense—and that metric can then be used to steer a load-balancer, spin up new nodes in the cloud, etc. I think that one deserves to be added to the list as well.
Where do we want to put code samples? It might be good to keep them in the main repo and only link to them from the wiki (unless they are really small). This would in principle allow the samples to be compile-checked as well—where that makes sense.
Even better if the test cases are fixtures so it is possible to test different implementations :)
Hi folks,
I added three samples which I had in my mind:
I’m sure that the list still is far from exhaustive but I’m hesitant to flesh out things more before you guys tell me that I’m not running completely into the wrong direction ;-)
Regards,
Roland