findmentor-network / analytics

We want to create a new-gen analytics platform for developers to developers. This project's aim is a one-click analytics solution for open-source projects & more.
https://analytx.dev/
MIT License
18 stars 6 forks source link

Data Streaming Pipeline For Realtime Dashboards #16

Open cagataycali opened 3 years ago

cagataycali commented 3 years ago

@onurbaran is helping us when creating data pipelines, the draft schema exists below.

image

muratdemirci commented 3 years ago

We can use cross-db ORM for adaptation, which it means if someone has end product (like website/webapp/mobileapp etc...) he can smoothly adapt current product. Example: https://github.com/biggora/caminte

cagataycali commented 3 years ago

https://druid.apache.org/

onurbaran commented 3 years ago

Caminte is an interesting project for hybrid data storage solutions. We can use when we seperate our data for hot and cold segments. (if needed) Let's go step by step and define our high level components and requirements. (it will be a little long story for warm up myself)

image

1.Streaming Processor This stage can be defined as the stage where we will touch incoming data for the first time. One of the most important points we should pay attention to at this stage is that our communication with the API layer should continue without being blocked.

First of all we must think which technologies, tools,concepts we need. Generally we must check our requirements and goals in this step. Our first goal is to perfectly handle incoming requests under high traffic and create the ideal pipeline for real-time data processing. If we play the game on data driven projects we must think and guess i++ steps from current requirements/needs. So I added analytics and long term storage stages my high level diagram for describe general approach. I have some analytic plans about implementing Hidden Markov Chains with real time click streaming data.

We can use Apache Kafka to handle API's data and distribute streaming events to the Real Time, Analytics and Storage parts of project with Kafka topics.

2. Streaming Engine Streaming engine's main behaviour is consume the processor's (1) data and invoke some pre-calculations jobs. Pre-calculation jobs can be invoked according to kafka topics or incoming streaming data's attributes. We can use Apache Druid, Spark, KSQL(Kafka SQL), Apache Flink and custom solutions as a consumer. For example if we want to continue wtih custom solution We can use Haskell (yeah apply mathematician guy is here), go, scala, python ..... I recommend to use go and go channels as a consumer part of this project.Because we can invoke some mini hidden processing pipelines with go channels. Hidden processing pipeline means "auto generated processing rules" here.You can think of these operations as simple map reduce flow. Of course, the biggest difference is that we will manage our processor on the fly like Spark processors.

image

So we can pass pre calculated data to our real time stage. Real time stage means keep only hot data and serve dashboards, alerts etc tools with API, socket ... We can use different kind of technologies with some "analytic first thinking" techniques. For example we can use Redis with timestamp:page_name count notation. Or we can use Elastic, VoltDB, MongoDB .. All this technologies have some pros and cons so we need to focuse our goals and base requirements.

okankaraduman commented 3 years ago

Hello, I think it's very important to check, best practices. Below, there is a schema of Snowplow Data Processing Pipeline. ( Do-it yourself solution for clickstream data ) snowplowarchitecture They are offering various trackers but that is not our case. After we take our data from Javascript tracker, there is an API (written with Scala) for collecting them and publish that data to Kafka topics.(In GCP, we can consider Pub/Sub, AWS Kinesis but it's far more expensive). There is an enrichment process which we enrich the data and write this data to another topic( like Geolocation etc. worth to check it out from here : https://docs.snowplowanalytics.com/docs/enriching-your-data/available-enrichments/ip-lookup-enrichment/ )

Finally, loader api for loading data to database. In GCP, their loader api simply just start dataflow jobs for ingesting data to Bigquery. (It's expensive.)

My Thoughts About Data Pipeline

I am fully agree with @onurbaran about using GO for collector&enrichment&loader API. (world moving that way) For streaming data, I think only option is Kafka. There is a managed version of it offering by Confluent.(If we do not want to mess with that ) https://www.confluent.io/confluent-cloud/

Finally, below schema, encapsulates all my thoughts about pipeline. I took it from slide, which I found at deep deep internet :) Don't sue me because of resolution :(

Screen Shot 2021-01-08 at 3 14 06 AM