COOL-cohort / COOL

the source code of the COOL system
https://www.comp.nus.edu.sg/~dbsystem/cool/
Apache License 2.0
45 stars 16 forks source link

Some tool Limitations/ New features to be added #30

Closed raghavchalapathy closed 2 years ago

raghavchalapathy commented 2 years ago

Many thanks for the great work ! I found some limitations of the tool in which we cannot use it in current state: 1) Data for analysis should be present as a CSV file, /Avro/HDFS, etc: But we have our data as Google Big Query Tables, we need to build a connector for the same. 2) While the tool allows us to run cube query, iceberg query and cohort query.: We need a set of a utility to generate this queries given the column names, writing this query is now manual and time consuming, especially if we have several features, 3) What is the benefit of using cool while we can run cohort analysis in BigQuery tables itself this needs to be established I cannot see the comparision done in the Paper published even 4) While most enterprise data resides on Big query tables escpecially for works related to modelling developing new architectures the current csv, avro, etc formats inhibits the data scientists to use the tool since we are dealing with large scale data of gigabits which need to be exported and stored even if we develop neural network models how do we overcome this limitation: Kindly suggest if this is not limitation 5) Looking forward to colloborate if these features are already supported and work closely to establish a Viable POC( Proof of concept)

KimballCai commented 2 years ago

Thanks for your meaningful comments and suggestions.

  1. I have spent some time to learn about the Google Big Query. If my understanding is correct, we also need to upload the data into this data house, and Goolgle Big Query also supports for CSV, AVRO, PARQUET and other data formats.
  2. At first, it is unavoidable that cohort analysis needs complicate cohort queries where we need to define cohorts, their birth events, measurements, and so on. Specifically, at least five SQL queries are needed for a conventional OLAP database system to perform cohort analysis in a non-intrusive manner, and I also find an example in the blog[1]. For more details, please refer to the paper[2]. To tackle this challenge, COOL has been designed to provide user-friendly querying primitives to address the pain point of writing complex and lengthy queries for cohort analysis using SQL-like languages. Besides, to design the cohort query easily, we also design a website application (i.e., the project named as COOL-webapp[3]) to help users define the cohort query visually.
  3. The vision of COOL is to address the inefficiency of underlying database systems processing cohort analysis (cohort queries) which is an emerging and widely-used analysis pattern in various areas. In COOL, cohort query processing is facilitated by specialized operators that involve only two fast scans on sophisticated storage to achieve real-time responses. Hence, compared with most SQL-based databases, COOL can run the cohort query with relatively more flexible cohort queries as explained in the point 2. Beisdes, it supports better query efficiency, and we have experimentally compared the COOL with several state-of-the-art analytical systems in the paper[4]. What’s more, the Google Big Query is a close-source project while we want to push it into n open-source project where we can contribute the development of the COOL system together. [4] I have seen the Big Query ML, but our target concentrates on the time-series data which is different from these non-time-seires data modelling that has been done by Big Query ML. In future, we will also try to figure out if we can connect the database to the AI modeling. Besides, we are also exploring the cohort modeling in the COOL system for better development of neural networks. For example, we can directly leverage the cohort analysis results to facilitate the building of neural networks, or we can supervisely define the cohorts to explore meaningful cohort patterns. These research are still on progress. [5] COOL system is still under development, and we are devoting ourselves to make it a useful and efficient tool for large-scale cohort analysis. [1] Cohort analysis using Data Studio and BigQuery, https://hodo.dev/posts/post-32-gcp-bigquery-cohort-report/ [2] D. Jiang, Q. Cai, G. Chen, H. V. Jagadish, B. C. Ooi, K.-L. Tan, and A. K. H. Tung. Cohort Query Processing, in Proceedings of the VLDB Endowment, 10(1), 2016. [3] COOL website application, http://ec2-13-212-103-48.ap-southeast-1.compute.amazonaws.com:8201/en/login/?next=/en/dashboard [4] Z. Xie, H. Ying, C. Yue, M. Zhang, G. Chen, B. C. Ooi. Cool: a COhort OnLine analytical processing system, in 2020 IEEE 36th International Conference on Data Engineering, pp.577-588, 2020.