diffix / explorer

Tool to automatically explore and generate stats on data anonymized using Diffix
MIT License
2 stars 1 forks source link

Correlation analysis limited to 31 columns. #345

Closed dandanlen closed 3 years ago

dandanlen commented 3 years ago

(See #344)

Correlation analysis is limited to 31 (useful) columns - useful in this context meaning the column is not invariant. For example a column will all null values would not be counted here.

This has the downside that if there are more than 31 useful columns in a table, they will not be considered for correlation analysis. Hopefully this is rare enough as not to be a major issue, however in the long run this will need to be fixed.

An exhaustive fix for this issue involves breaking the correlation analysis queries up into sub-queries so that each sub-query has no more than 31 columns, and then recombining the results for the correlation analysis (probability matrices). The trickiest part of implementing this will involve managing the grouping_id and associated column groupings for each partial query. The current implementation assumes a single grouping_id that can be used as a proxy for each grouping of columns.

sebastian commented 3 years ago

For the time being it's probably sufficient to describe this limitation in the docs.

dandanlen commented 3 years ago

Closing as won't fix, issue #348 exists to mention the limitation in the docs.