cda-group / arc

Programming Language for Continuous Deep Analytics
https://cda-group.github.io/arc/
44 stars 6 forks source link

Use-cases #267

Closed segeljakt closed 1 year ago

segeljakt commented 3 years ago

This issue is about finding new use-cases for arc-script.

Problem

Applications are needed to motivate the design of arc-script. For arc-script we have multiple levels of abstractions:

In the ideal case an application is just about implementing an algorithm (such as PageRank). Generally however data must also be pre- and post-processed. Where things get hairy is when we need to also interact with other systems. For example, Alibaba's use case requires interaction with MySQL/HBase data sources, and Druid data sinks. Zalando requires interaction with Kafka/S3 data sources.

Application requirements

TODO: List of requirements which our runtime/language must offer

Application non-requirements

TODO: List of requirements which our runtime/language cannot offer at the moment (but likely never)

List of Applications

This is a continuously updated list of applications in data analytics which can help motivate the design of Arc-Script. I will fill in some details to explain the idea and requirements of the use cases. Some use cases here share similar patterns in their implementation. Use cases may also be building blocks for other use cases. Data analytics pipelines may be composed of many different algorithms. This makes the distinction about what should be provided by our language and what should be possible to implement using our language important.

Streaming Algorithms

Streaming algorithms have a well-established theory behind them (known bounds on time and space). However, their applications are very specific... too specific to cover the area of BigData analytics. BigData analytics can however sometimes involve the use of streaming algorithms.

Requirements

Requirements of such algorithms are that they:

Examples

Some examples from Wikipedia are Frequency Moments, Frequent Elements, Event Detection, Counting Distinct Elements, Feature Hashing, Stochastic Gradient Descent, Bloom filter, Count-min sketch, Locality-sensitive hashing, MinHash, SimHash, w-shingling. We should not delve into individual algorithms since these algorithms are rarely useful on their own in the broader picture.

BigData Streaming

BigData streaming is about processing massive amounts of information in real-time. Analytics are both continuous and deep, and for this reason pose harder requirements than general streaming algorithms.

Requirements

Requirements of continuous deep data analytics are as follows:

Examples

Following is a list of examples of BigData streaming:

Complex Event Processing Algorithms

In the words of Esper, "Complex Event Processing" is about analysing events to find situations of interest. CEP detects and derives information, which can be reacted to by deciding and doing an action. This is known as the 4D model (Detect-Derive-Decide-Do).

An example situation to be detected is: A suspicious account is derived whenever there are at least three large cash deposits in the last 15 days.

  • The "Detect" is about the raw event, for example a cash deposit event.
  • The "Derive" is about the situation, i.e. "did something happen?", for example there is a suspicious account.
  • The "Decide" is about the decision of what to do, for example the decision to determine a risk score or determine another course of action
  • The "Do" is the action, for example an action that opens an investigation

Resources