Website | Documentation | Blog | Demo | GitHub
Different groups of people often have different behaviors or trends. For example, the bones of older people are more porous than those of younger people. It is of great value to explore the behaviors and trends of different groups of people, especially in healthcare, because we could adopt appropriate measures in time to avoid tragedy. The easiest way to do this is cohort analysis.
However, with a variety of big data accumulated over the years, query efficiency becomes one of the problems that OnLine Analytical Processing (OLAP) systems meet, especially for cohort analysis. Therefore, COOL is introduced to solve the problems.
COOL is an online cohort analytical processing system that supports various types of data analytics, including cube query, iceberg query and cohort query.
With the support of several newly proposed operators on top of a sophisticated storage layer, COOL could provide high-performance (near real-time) analytical responses for emerging data warehouse domains.
mvn clean package
table.yaml
file specifying the dataset's columns and their measure fields.Before query processing, we need to load the dataset into COOL native format. The sample code to load csv dataset with data loader can be found in CsvLoader.java.
./cool load \
dataset \
path/to/your/.yaml \
path/to/your/datafile \
path/to/output/datasource/directory
The five arguments in the command have the following meaning:
table.yaml
(the third required source)We provide an example for cohort query processing in CohortAnalysis.java.
./cool cohortselection \
path/to/output/datasource/directory \
path/to/your/queryfile
./cool cohortquery \
path/to/output/datasource/directory \
path/to/your/cohortqueryfile
./cool funnelquery \
path/to/output/datasource/directory \
path/to/your/funnelqueryfile
./cool olapquery \
path/to/output/datasource/directory \
path/to/your/queryfile
We have provided examples in sogamo
directory and health_raw
directory. Now we take sogamo
for example.
The COOL system supports CSV data format by default, and you can load sogamo
dataset with the following command.
./cool load csv \
sogamo \
datasets/sogamo/table.yaml \
datasets/sogamo/data.csv \
./CubeRepo
In addition, you can run the following command to load the dataset in other formats under the sogamo
directory.
./cool load parquet \
sogamo \
datasets/sogamo/table.yaml \
datasets/sogamo/data.parquet \
./CubeRepo
./cool load arrow \
sogamo \
datasets/sogamo/table.yaml \
datasets/sogamo/data.arrow \
./CubeRepo
./cool load avro \
sogamo \
datasets/sogamo/table.yaml \
datasets/sogamo/avro/test.avro \
./CubeRepo \
datasets/sogamo/avro/schema.avsc
There will be a cube generated under the ./CubeRepo
directory, which is named sogamo
.
Similarly, load the health_raw
dataset with:
./cool load \
health_raw \
datasets/health_raw/table.yaml \
datasets/health_raw/data.csv \
./CubeRepo
We use the health_raw
dataset for example to demonstrate the cohort analysis.
./cool cohortselection \
./CubeRepo \
datasets/health_raw/sample_query_selection/query.json
where the arguments are:
./CubeRepo
: the output directory for the compacted datasetdatasets/health_raw/sample_query_selection/query.json
: the cohort query (in JSON)./cool cohortquery \
./CubeRepo \
datasets/health_raw/sample_query_average/query.json
We use the sogamo
dataset for example to demonstrate the funnel analysis.
./cool funnelquery \
./CubeRepo \
datasets/sogamo/sample_funnel_analysis/query.json
We have provided examples in olap-tpch
directory.
The COOL system supports CSV data format by default, and you can load tpc-h
dataset with the following command.
./cool load \
tpc-h-10g \
datasets/olap-tpch/table.yaml \
datasets/olap-tpch/scripts/data.csv \
./CubeRepo
Finally, there will be a cube generated under the ./CubeRepo
directory, which is named tpc-h-10g
.
Run Server
application.property
file at the same level as the .jar file.application.property
file../cool server
COOL has an StorageService interface, which will allow COOL standalone server/workers (coming soon) to handle data movement between local and an external storage service. A sample implementation for HDFS connection can be found under the hdfs-extensions.
Q. Cai, K. Zheng, H.V. Jagadish, B.C. Ooi, J.W.L. Yip. CohortNet: Empowering Cohort Discovery for Interpretable Healthcare Analytics, in Proceedings of the VLDB Endowment, 10(17), 2024.
Z. Xie, H. Ying, C. Yue, M. Zhang, G. Chen, B. C. Ooi. Cool: a COhort OnLine analytical processing system, in 2020 IEEE 36th International Conference on Data Engineering, pp.577-588, 2020.
Q. Cai, Z. Xie, M. Zhang, G. Chen, H.V. Jagadish and B.C. Ooi. Effective Temporal Dependence Discovery in Time Series Data, in Proceedings of the VLDB Endowment, 11(8), pp.893-905, 2018.
Z. Xie, Q. Cai, F. He, G.Y. Ooi, W. Huang, B.C. Ooi. Cohort Analysis with Ease, in Proceedings of the 2018 International Conference on Management of Data, pp.1737-1740, 2018.
D. Jiang, Q. Cai, G. Chen, H. V. Jagadish, B. C. Ooi, K.-L. Tan, and A. K. H. Tung. Cohort Query Processing, in Proceedings of the VLDB Endowment, 10(1), 2016.