Open yoid2000 opened 3 weeks ago
Basic suggested approach:
Establish a limit on the number of sub-tables (say 10k).
Add a phase whereby we determine which sub-tables we want to build without actually building them. We do this by first building all the 2dim subtables, measuring their dependence, and then using these measures to determine the quality of sub-tables.
We go through a process where we explore the most promising 3dim sub-tables (i.e. start with the highest dependence 2dim sub-tables), then the most promising 4dim etc, In the process, we record what max_weight
and merge_thresh
would have led to these sub-tables under the current blob-building procedure.
Currently we have code like this:
average_quality = sum(dependency_matrix[col, c] for c in cluster.columns) / len(cluster.columns)
# Skip if below threshold or above weight limit.
if average_quality < merge_thresh or (
len(cluster.columns) > DERIVED_COLS_MIN and cluster.total_entropy + col_weights[col] > capacity
):
Context of above code is in a loop that goes through the permutation columns in order, and measures the sub-table quality as it does until quality drops below merge_thresh
.
Note merge_thresh
is default 0.1. As the number of columns in a sub-table increases, the effect of a single low-dependence pair is reduced. But this might not matter in practice (i.e., we don't need some additional parameter like min_dependence
or something)
max_weight
is related to the sum of column weights (default 15). This puts some kind of limit on sub-table size. Column weights are computed like this:
col_weights = list(_col_weight(col) for col in context.entropy_1dim)
So, related to the entropy of the column, which basically is determined by the number of distinct vals IIRC.
ok, for now, just try to measure average_quality
and col_weights
and see how they behave.
I suspect the way to do this is to use average_quality
to find the best sub-tables in the builder, and then use a solver to find the best clustering in the reader.
Currently the blob system doesn't scale beyond 10 or so columns. This is because we currently build every possible sub-table.
What is needed instead are smarter sub-tables, where we avoid building sub-tables that have low quality because one or more columns are independent of the others.