Closed sebastian closed 3 years ago
Not quite sure what data in particular you have in mind but I have been looking at column smoker
in table visits
.
The column data has a single entry, { "count": 3, "value": 2 }
. It think that when the correlations come into play, they cause the value to be suppressed: Increasing the number of columns per grouping reduces the average count in each bucket, so it's likely that buckets that were already on the verge of being suppressed in a 1-dimensional query are then actually suppressed in the two-dimensional query. I guess this doesn't even depend much on correlations - even a change in the noise factor could cause the count to dip below the low-count threshold.
Here are a couple of ways to mitigate this:
*
value when generating correlated samples. For any *
values, we generate a value from the single-column distributions instead. I think 2.
is the better solution since it's not clear what the right threshold level would be in various situations. It can also be generalised to more than 2 columns:
N
columns*
, split the sample generation into M
and K
columns where M + K == N
based on the correlation factors of every possible combination of M
and K
.*
, repeat step 2.There is also room for improvement when calculating the correlation factor. Currently it's based on a very coarse calculation that only considers how many buckets are non-suppressed. A more rigorous approach would involve calculating a matrix of prior expectation values based on the single-column distributions (E(A)*E(B)
) and comparing this to the matrix of posterior probabilities generated from the multi-column queries (E(AB)
) and taking the sum of the square of the difference (SQRT( SUM_ALL( (E(AB) - (E(A)*E(B)) ^2) )
) or something like this. Again, this can potentially be generalised to multiple dimensions.
I think solution 2 is good. Generally it looks better to have some values, rather than no values. I.e. if there aren't any values for 3-column correlations then using 2-column correlations to get some value is better than nothing. And if nothing is available for 2-column correlations, then the single column base case is again better than nothing. etc etc
We decided this might be a lot of work. Let's leave this as a task that can be taken on if time permits, but that we will otherwise ignore for the time being.
When looking at the results of the Clinic dataset I notice that the sample table is missing values for many columns that actually do have sample values in the per-column data. Why is this?