calrissian / flowmix

Flowmix is a flexible event processing engine for Apache Storm. It supports complex correlations of events via sliding/tumbling windows. It allows parallel streams to be processed that can be grouped together in different ways.
Apache License 2.0
55 stars 20 forks source link

flowmix design doc #54

Open zqhxuyuan opened 8 years ago

zqhxuyuan commented 8 years ago

writing a chinese document about flowmix designment(not totally finished yet): http://zqhxuyuan.github.io/2015/07/26/2015-09-11-Flowmix-CEP/ hopefully can help someone.

@cjnolet After deep into flowmix source code, I also have some question:

AggregatorWindow is composed of Aggregator and Window, And Aggregator is response for storing aggregate variable, while Window is storing original Event. Normally there are PartitionOp before AggregatorOp to do some group by operation. And as partition make sure One Partition corresponding One Window. If Window store at most 1000 events, and there are 1000 partition, suppose One event take 1kb size So windows memory in AggregateBolt take 1000Partition*1000kb=1GB. So that's why Aggregator store temporary variable which is good at aggregate result. my question is If Aggregator temporary variable is good enough,why do we need Window events?

cjnolet commented 8 years ago

Hey,

The design doc looks great so far! I haven't look at it in extreme detail but I like what I saw upon a quick browse.

In reference to your question, if each event is 1kb in size and it is partitioned and grouped, the 1GB should actually be spread across the cluster. The trick with CEP here is to do the memory management to make sure heaps aren't blown (and I suppose if garbage collection becomes a concern, a local back end like Redis could help get it off the heap).

The reasoning behind storing the events and passing them along with the aggregate function is that events can expire out of a window and at that point they need to be expired from the aggregate function as well. Generally, it would be good practice to filter out only the attributes in the event that would be needed for the aggregation rather than storing every event.

If you have ideas on ways to minimize the footprint, pull requests would certainly be welcome here.

Thanks again.