diffix / explorer

Tool to automatically explore and generate stats on data anonymized using Diffix
MIT License
2 stars 1 forks source link

Validating quality of synthesized multi-column correlated data #293

Open sebastian opened 4 years ago

sebastian commented 4 years ago

Following our meeting I thought I'd document a few ways one could validate data quality:

For single or multi-column correlated generated datasets

Compare with the results of the unanonymized equivalent Aircloaked data source. This can be done by generating subdivisions in the data and comparing what fraction of the values are in the respective subdivisions.

Example:

Visual inspection of generated geo data

It turned out to be very useful to visually inspect the data quality of generated datasets. You can easily fool yourself into believing the data quality is appropriate if you only use some arbitrary abstract numerical metric. Fooling your eyes is altogether more difficult.

Generating a high quality two-dimensional latitude longitude dataset is quite trivial and you are likely to get very good results. Three-dimensionsal ones isn't all too bad either, but from experience once you add in more dimensions correlations quickly start suffering. This will be immediately obvious when visually inspecting geo-location data. I therefore encourage you to set up some geo rendering pipeline which renders locations as dots on a map. It will make correlation artifacts hard to ignore. The NYC taxi database is a good candidate for this (for example you wouldn't expect a vertical line of dots in the middle of the water...)

Comparing distribution characteristics

The Accord library you have in place for determining characteristics of the numerical distributions could be used to generate characteristics of both the dataset generated using explorer and then raw data. These parameters could be compared allowing for a certain delta. This does require a pretty good understanding of what these parameters mean, and how much of a deviance can be allowed.

sebastian commented 3 years ago

This might be a too complex task for the remaining time, or do you already have something in the works here?

dandanlen commented 3 years ago

Hah I was just looking at this and thinking the same. Andrei already built a basic webpage for visual inspection of the taxi dataset. Let's chat this afternoon, we can prioritise remaining tasks.

sebastian commented 3 years ago

Yes, let's chat this afternoon and prioritize remaining tasks

sebastian commented 3 years ago

To be done:

sebastian commented 3 years ago

See: https://github.com/diffix/explorer/issues/350