apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.22k stars 437 forks source link

Make Gluten project size small #4976

Open ulysses-you opened 8 months ago

ulysses-you commented 8 months ago

Description

It now contains 257MB even if we clone the latest commit git clone --depth 1 https://github.com/apache/incubator-gluten.git

the biggest directories are:

27M     ./docs/image/gluten_golden_file_upload.png
29M ./gluten-core/src/test/resources/tpch-data
30M ./backends-clickhouse/src/test/resources/tpch-data-bucket/parquet_bucket
40M ./gluten-celeborn/clickhouse/src/test/resources/tpch-data-ch

I think we can unify the test data and reduce some unnecessary image size.

ulysses-you commented 8 months ago

/docs/image/gluten_golden_file_upload.png

cc @zwangsheng

29M ./gluten-core/src/test/resources/tpch-data 30M ./backends-clickhouse/src/test/resources/tpch-data-bucket/parquet_bucket 40M ./gluten-celeborn/clickhouse/src/test/resources/tpch-data-ch

cc @zzcclp @PHILO-HE

PHILO-HE commented 8 months ago

@ulysses-you, thanks for reporting this issue. Yes, we also note this. We have a few binary files that can be moved outside the repo perhaps. If we remove them completely from git, historical commit hash should have to be changed. cc @weiting-chen, @FelixYBW

zzcclp commented 8 months ago

we can remove the data ./gluten-celeborn/clickhouse/src/test/resources/tpch-data-ch after modifying the related ut.