diffix / syndiffix

Python implementation of the SynDiffix synthetic data generation mechanism.
Other
6 stars 1 forks source link

Scalable blob building #148

Open yoid2000 opened 3 weeks ago

yoid2000 commented 3 weeks ago

Currently the blob system doesn't scale beyond 10 or so columns. This is because we currently build every possible sub-table.

What is needed instead are smarter sub-tables, where we avoid building sub-tables that have low quality because one or more columns are independent of the others.

yoid2000 commented 3 weeks ago

Basic suggested approach:

Establish a limit on the number of sub-tables (say 10k).

Add a phase whereby we determine which sub-tables we want to build without actually building them. We do this by first building all the 2dim subtables, measuring their dependence, and then using these measures to determine the quality of sub-tables.

We go through a process where we explore the most promising 3dim sub-tables (i.e. start with the highest dependence 2dim sub-tables), then the most promising 4dim etc, In the process, we record what max_weight and merge_thresh would have led to these sub-tables under the current blob-building procedure.

yoid2000 commented 3 weeks ago

Currently we have code like this:

        average_quality = sum(dependency_matrix[col, c] for c in cluster.columns) / len(cluster.columns)

            # Skip if below threshold or above weight limit.
            if average_quality < merge_thresh or (
                len(cluster.columns) > DERIVED_COLS_MIN and cluster.total_entropy + col_weights[col] > capacity
            ):

Context of above code is in a loop that goes through the permutation columns in order, and measures the sub-table quality as it does until quality drops below merge_thresh.

Note merge_thresh is default 0.1. As the number of columns in a sub-table increases, the effect of a single low-dependence pair is reduced. But this might not matter in practice (i.e., we don't need some additional parameter like min_dependence or something)

max_weight is related to the sum of column weights (default 15). This puts some kind of limit on sub-table size. Column weights are computed like this:

col_weights = list(_col_weight(col) for col in context.entropy_1dim)

So, related to the entropy of the column, which basically is determined by the number of distinct vals IIRC.

ok, for now, just try to measure average_quality and col_weights and see how they behave.

yoid2000 commented 2 weeks ago

I suspect the way to do this is to use average_quality to find the best sub-tables in the builder, and then use a solver to find the best clustering in the reader.