apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.5k stars 1.29k forks source link

Remove raw data from binary release #7240

Open snleee opened 3 years ago

snleee commented 3 years ago

Currently, our binary distribution packs ~100MB raw data for pinot quick starter scripts. Removing this can greatly reduce the size of our official binary distribution, which is currently over 500MB.

7.5M    ./examples/minions/batch/baseballStats/rawdata/baseballStats_data.csv
21M ./examples/batch/githubEvents/rawdata_json_index/githubEvents_data.json
20M ./examples/batch/githubEvents/rawdata_complexTypeHandling/githubEvents_data.json
7.5M    ./examples/batch/baseballStats/rawdata/baseballStats_data.csv
31M ./examples/stream/airlineStats/sample_data/airlineStats_data.json
3.0M    ./examples/stream/airlineStats/sample_data/airlineStats_data.avro
...

For the above files, we should change the script to download instead of packing along with the release.

xiangfu0 commented 3 years ago

where can we download those data from? Maybe move them to another directory and download them from the github URL?

snleee commented 3 years ago

@xiangfu0

I was thinking of the similar approach:

  1. Download from git URL using curl

https://github.com/apache/pinot/raw/master/pinot-tools/src/main/resources/examples/batch/airlineStats/rawdata/2014/01/01/airlineStats_data_2014-01-01.avro

curl https://raw.githubusercontent.com/apache/pinot/master/pinot-tools/src/main/resources/examples/batch/airlineStats/rawdata/2014/01/01/airlineStats_data_2014-01-01.avro
  1. Run git clone and then copy the example files, or git probably supports to checkout the specific directory from the repository.