Open sebastian opened 4 years ago
This might be a too complex task for the remaining time, or do you already have something in the works here?
Hah I was just looking at this and thinking the same. Andrei already built a basic webpage for visual inspection of the taxi dataset. Let's chat this afternoon, we can prioritise remaining tasks.
Yes, let's chat this afternoon and prioritize remaining tasks
To be done:
Following our meeting I thought I'd document a few ways one could validate data quality:
For single or multi-column correlated generated datasets
Compare with the results of the unanonymized equivalent Aircloaked data source. This can be done by generating subdivisions in the data and comparing what fraction of the values are in the respective subdivisions.
Example:
Visual inspection of generated geo data
It turned out to be very useful to visually inspect the data quality of generated datasets. You can easily fool yourself into believing the data quality is appropriate if you only use some arbitrary abstract numerical metric. Fooling your eyes is altogether more difficult.
Generating a high quality two-dimensional latitude longitude dataset is quite trivial and you are likely to get very good results. Three-dimensionsal ones isn't all too bad either, but from experience once you add in more dimensions correlations quickly start suffering. This will be immediately obvious when visually inspecting geo-location data. I therefore encourage you to set up some geo rendering pipeline which renders locations as dots on a map. It will make correlation artifacts hard to ignore. The NYC taxi database is a good candidate for this (for example you wouldn't expect a vertical line of dots in the middle of the water...)
Comparing distribution characteristics
The Accord library you have in place for determining characteristics of the numerical distributions could be used to generate characteristics of both the dataset generated using explorer and then raw data. These parameters could be compared allowing for a certain delta. This does require a pretty good understanding of what these parameters mean, and how much of a deviance can be allowed.