Unclear behaviour of SparkColumnsToIndexSelector when DataFrame is empty

Qbeast-io / qbeast-spark

Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!

https://qbeast.io/qbeast-our-tech/

Apache License 2.0

216 stars 20 forks source link

Unclear behaviour of SparkColumnsToIndexSelector when DataFrame is empty #295

Open osopardo1 opened 8 months ago

osopardo1 commented 8 months ago

What went wrong?

When enabling auto indexing, we call SparkColumnsToIndexSelector to choose which are the best columns to group the data.

This selection is based on statistics and correlations of the data itself, but if no data is provided, the current default behavior is to select the first N columns of the schema.

We should define and concrete if that makes sense and what is the minimum number of columns to index.

osopardo1 commented 7 months ago

After some discussion, we agreed that, if the DataFrame is empty, makes little sense to use AutoIndexing right away. The code should wait until some data is written to activate the feature.