DataGov-SamagraX / deploy

My learnings on setting up a cluster using docker-compose
0 stars 2 forks source link

Resources #1

Open ChakshuGautam opened 2 years ago

ChakshuGautam commented 2 years ago

Docker Based Guides

  1. https://github.com/panovvv/bigdata-docker-compose
  2. https://github.com/big-data-europe/docker-hadoop
  3. https://marcel-jan.eu/datablog/2020/10/25/i-built-a-working-hadoop-spark-hive-cluster-on-docker-here-is-how/

Aggregated Links

  1. Awesome

Architecture

Adding these here since these will be needed for managing the cluster and the maintainer should know the basics.

  1. Hadoop Architecture Guide
  2. Hbase Architecture
  3. Comparison with Cassandra
  4. Hive vs Hbase
  5. Streaming Hbase Edits
  6. Hive Connector with Trino

Thumb Rules

  1. Slide 6

Benchmarks

The end goal is to setup this up on a Gitpod with some sample data to play around with.

Unsorted

  1. How to beat the CAP theorem
ChakshuGautam commented 2 years ago

I dived deeper into trino and it does not allow storage of cubes. https://trino.io/episodes/13.html "Trino performs batch processing and is not a realtime system where Pinot is great for ingesting data in batch or stream." "The other key word, low latency could technically apply to both Pinot and Trino but in the context of realtime subsecond latency, Trino is slow compared to Pinot."

This is where Kylin was good. This a good article on when/why to use Trino. Don't think it can do subsecond latency queries and metabase will ask trino to do the same query again and again for every user making it a bottleneck. We should discuss this with Prutech. Rather they should have come up with this during today's call.

Either we add a Pinot for sub-sesond querying or we ask them for a suggestion.

rahul101001000 commented 2 years ago

Pasting your other message as well -

In summary there should be one OLAP database (Clickhouse will be my pick, but Druid and Pinot are popular as well) between Hbase and Metabase which can be updated by either Kylin or Presto/trino. But we cannot do without it.

ChakshuGautam commented 2 years ago

We have two good choices right now -

  1. Citus Data (if we want to do a mix of Data Warehouse + APIs out of it which work with Hasura and everything else that we are doing). It will scala horizontally like Hbase, can cache things and 100% PSQL compliant. It is an extensions to PSQL which converts it to a distributed database. This is a good choice for Samarth use case. Citus Data is owned by Microsoft and they have paid offering using that on Azure. So I know it is battle tested.
  2. Clickhouse - If we want to do subsecond analytics and use it purely for aggregates, analytics, etc. Best OLAP database right now (comparable with Redshift - which is Amazon's product). Posthog uses it and backed by huge VC fundings.

Articles on how they both fared for cloudflare Cloudflare articles supporting the research - https://www.citusdata.com/blog/2015/04/14/scaling-cloudflare-citusdb/ They shifted to Clickhouse in 2018 - https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/


Also, I do think we should do benchmarking on all three to find out which is better. I would recommend either with the Github Data or the NYC Cab rides Data. Both have over a billion rows and we should try our hands on all three of them to choose one. That would be more data driven.


Assuming the current choices are hbase (hbase is not OLAP so shouldn't be in this list), Clickhouse and Citus. Citus will be the slowest of all three for analytical queries but can compete with others for our use case and can be used with Hasura etc as the main DB without changing much and we will have a unified system to handle both OLAP and OLTP type queries.

For a completely OLAP bases loads, I think Clickhouse is competing really well with Redshift and beating it on commodity hardwares like VMs without GPUs (govt use cases).