Open ChakshuGautam opened 2 years ago
I dived deeper into trino and it does not allow storage of cubes. https://trino.io/episodes/13.html "Trino performs batch processing and is not a realtime system where Pinot is great for ingesting data in batch or stream." "The other key word, low latency could technically apply to both Pinot and Trino but in the context of realtime subsecond latency, Trino is slow compared to Pinot."
This is where Kylin was good. This a good article on when/why to use Trino. Don't think it can do subsecond latency queries and metabase will ask trino to do the same query again and again for every user making it a bottleneck. We should discuss this with Prutech. Rather they should have come up with this during today's call.
Either we add a Pinot for sub-sesond querying or we ask them for a suggestion.
Pasting your other message as well -
In summary there should be one OLAP database (Clickhouse will be my pick, but Druid and Pinot are popular as well) between Hbase and Metabase which can be updated by either Kylin or Presto/trino. But we cannot do without it.
We have two good choices right now -
Articles on how they both fared for cloudflare Cloudflare articles supporting the research - https://www.citusdata.com/blog/2015/04/14/scaling-cloudflare-citusdb/ They shifted to Clickhouse in 2018 - https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/
Also, I do think we should do benchmarking on all three to find out which is better. I would recommend either with the Github Data or the NYC Cab rides Data. Both have over a billion rows and we should try our hands on all three of them to choose one. That would be more data driven.
Assuming the current choices are hbase (hbase is not OLAP so shouldn't be in this list), Clickhouse and Citus. Citus will be the slowest of all three for analytical queries but can compete with others for our use case and can be used with Hasura etc as the main DB without changing much and we will have a unified system to handle both OLAP and OLTP type queries.
For a completely OLAP bases loads, I think Clickhouse is competing really well with Redshift and beating it on commodity hardwares like VMs without GPUs (govt use cases).
Docker Based Guides
Aggregated Links
Architecture
Adding these here since these will be needed for managing the cluster and the maintainer should know the basics.
Thumb Rules
Benchmarks
The end goal is to setup this up on a Gitpod with some sample data to play around with.
Unsorted