gda-score / code

Tools for generating General Data Anonymity Scores (www.gda-score.org)
MIT License
7 stars 1 forks source link

generateDbScript has incorrect column_label #27

Closed yoid2000 closed 5 years ago

yoid2000 commented 5 years ago

The column amount on table orders in database raw_banking has column_label enumerative. It should be continuous. This may be related to #26 . (Since this column is a real, it should no matter what be continuous.)

In addition, the column acct_date also should be continuous. This one would not be related to #26.

Ok, I see that the original issue has this statement:

If the bucket with the smallest count is less than 1/2 of the average bucket count, then label the column continuous. Otherwise label the column enumerative.

Regarding acct_date, if I make this query:

SELECT floor(acct_date/4000)*4000, count(*)
FROM orders
GROUP BY 1
order by 1

Then I get 5 buckets back. (I guess there are gaps in the number space which prevent 10 buckets?). The average count of these 5 are 1573, and the min is 753, which is less than half of the average. So I think this should have been labeled as continuous.

This is based on running the script in https://gist.github.com/srnb/f7679c432a87af88ed957318fe8815bb

srnb commented 5 years ago

As X=max(col)-min(col)/10 -> 4112.8. Select query based on this x value give buckets of six and average amounts to 1311 and the min is 753, which is greater than half of the average. So the column is marked as enumerative. Can you please confirm?.

yoid2000 commented 5 years ago

Ok, you are right...I used the wrong numbers.

But wow, that function is insanely sensitive to bucket size. If I change the bucket size to 4113, then I get seven buckets, and the minimum bucket has count of 7! I guess what is happening is that there are gaps in the number space, and the output varies a lot depending on whether a bucket overlaps into a gap or not.

Ok, let's leave it as it is (we could probably play with this more, since acct_date is in fact continuous and we are mis-labeling it, but I already said that we'll sometimes get this wrong so that's fine.