When enabling auto indexing, we call SparkColumnsToIndexSelector to choose which are the best columns to group the data.
This selection is based on statistics and correlations of the data itself, but if no data is provided, the current default behavior is to select the first N columns of the schema.
We should define and concrete if that makes sense and what is the minimum number of columns to index.
After some discussion, we agreed that, if the DataFrame is empty, makes little sense to use AutoIndexing right away. The code should wait until some data is written to activate the feature.
What went wrong?
When enabling auto indexing, we call
SparkColumnsToIndexSelector
to choose which are the best columns to group the data.This selection is based on statistics and correlations of the data itself, but if no data is provided, the current default behavior is to select the first N columns of the schema.
We should define and concrete if that makes sense and what is the minimum number of columns to index.