IBMStreams / administration

Umbrella project for the IBMStreams organization. This project will be used for the management of the individual projects within the IBMStreams organization.
Other
19 stars 10 forks source link

Proposal: streamsx.clickstream #124

Closed rzbhatti closed 6 years ago

rzbhatti commented 6 years ago

Proposal

Proposed here is a toolkit for clickstream analytics. This toolkit will provide the basic functions and operators to build an application for click or tap stream analytics. It will also provide a streaming architecture based sample application for clickstream analytics.

Motivation

The real-time streaming analytics of click or tap streams bears an undeniable significance for digital transformation of all growing enterprises. It provides a way to monitor, qualitatively and quantitatively, the effectiveness of web or mobile applications. Our client engagement experience with large scale mobile enterprises show that the real-time clickstreams analytics is imperative to:

Toolkit Components

Clickstream Classification Operator

A scalable and dynamically updated set of classification rules are defined in a JSON file. Each JSON rule specifies string attribute of the input stream, to be matched against a specified string, partial string, or regex. When a rule is matched the specified attributes of the output stream are updated as per the given classification by that rule.

Custom aggregate functions for progressive and cascaded aggregates

Instead of “sliding windows aggregates”, cascaded “tumbing window aggregates” are used to produce Count-By-Distinct function.

Graph generator operator

A custom SPL operator to produce a graph JSON for:

rzbhatti commented 6 years ago

For graph visualization of the customer journey and path analytics visualization tools like cytoscope can be used. http://js.cytoscape.org http://marvl.infotech.monash.edu/webcola/index.html

ddebrunner commented 6 years ago

Is “thumbing window aggregates” a typo for "tumbling window aggregates” ?

chanskw commented 6 years ago

+1

gertmark commented 6 years ago

@rzbhatti are you proposing a set of predefined classifiers/schemas? Please give some examples. I support your proposal to provide a toolkit for this kind of analytics. I wonder if the use cases are limited to clickstream data or if this could be a generic approach to deal with rule based classifications and aggregations of messages or event data records.

dakshiagrawal commented 6 years ago

I am not sure how it should be classified - as a toolkit or as an example/sample/pattern. There are certain tricky/complex operators/composites which are out of the reach of the ordinary developer and hence it is worth making this public. These are:

(a) user-defined functions in aggregates is needed because out of the box functions are not sufficient. (b) cascading of aggregates needs to be done so as to avoid large memory footprint. (c) if the application needs to be brought down for maintenance, large windows of data is not lost (e.g., if clickstream is being tracked for a seven day period). (d) there are companion algorithms (and code) on the UI side, which lets you do "Since" queries - how many clicks for xyz since Wednesday....

chanskw commented 6 years ago

I support having a clickstream repository. Instead of a toolkit, would it make sense to classify them as microservices where we provide pre-built applications that perform these complex analytic functions? Would an example be provided to show how to stitch these together to produce a meaningful application. Is this something similar to streamsx.health where we would be providing domain specific services, accelerating the development of clickstream analytics applications?

rzbhatti commented 6 years ago

I agree that it is definitely more than just a toolkit of functions and operators only. It contains sample microservices applications for data acquisition, stream classification, global and session level aggregation, and finally the graphical visualizations of the analytics etc.

ddebrunner commented 6 years ago

+1 though I would encourage thinking about microservices as being intended to be used by users out of the box, rather than just being samples.

Maybe initially the toolkit could stay focused on clickstream analysis and then if a general pattern exists it could be extracted, rather than trying to start out with a general purpose solution with no clear goals in mind.

chanskw commented 6 years ago

Repository created, waiting for response from @rzbhatti regarding CLA... and then I will create the committer team.

chanskw commented 6 years ago

Added @rzbhatti to streamsx.clickstream project. Please review this welcome page to familiarize with some of the project guidelines: https://github.com/IBMStreams/administration/blob/master/welcome.md