DataResponsibly / DataSynthesizer

MIT License
252 stars 85 forks source link

Wrong Conditional Distributions Sensitivity #34

Closed ganevgv closed 3 years ago

ganevgv commented 3 years ago

Description

I believe the sensitivity of the conditional distributions is wrong and as a result you're injecting too little noise to the conditional distributions in the correlated attribute mode with enabled epsilon.

In Algorithm 1 in PrivBayes, the stated sensitivity is 2 / n_tuples because the Laplace noise is added to the joint probability (as the probability is already materialised). However, you apply the Laplace noise to the counts (as opposed to a probability) which have sensitivity of 2 (and not 2 / n_tuples). Only later you normalise the counts to materialise the conditional probability. As a result, you're injecting too little noise to the conditional probabilities.

Code excerpts from DataSynthesizer/lib/PrivBayes.py:

# Applying Laplace to the counts:
noise_para = laplace_noise_parameter(k, num_attributes, num_tuples, epsilon)
# with noise_para = 2 * (num_attributes - k) / (num_tuples * epsilon)
laplace_noises = np.random.laplace(0, scale=noise_para, size=stats.index.size)
stats['count'] += laplace_noises

# Normalising counts to a probability distribution:
dist = normalize_given_distribution(stats_sub['count']).tolist()

In order to inject the correct amount of noise 2 * (num_attributes - k) / (num_tuples * epsilon) should be changed to 2 * (num_attributes - k) / epsilon.

PS: thanks to @sofiane87 and @bristena-op for the helpful discussions and clarifications.

haoyueping commented 3 years ago

Thanks for your feedback and for reporting this error in the Laplace noise parameter. This bug has been fixed in 540a442867dd278a74af4c8f488fc6fadce7088a where the parameter is (num_attributes - k) / epsilon. In this setting, 1 is chosen as the sensitivity of count values.

reslbesl commented 3 years ago

I think that the new version should retain a sensitivity of 2 and only remove the factor of 1/num_tuples See also this blog post on how to reason about the correct sensitivity of count queries: https://github.com/frankmcsherry/blog/blob/master/posts/2018-02-25.md

haoyueping commented 3 years ago

The sensitivity of count queries can be either 1 or 2 depending on how the privacy of an individual is interpreted. The sensitivity of 1 corresponds to the "standard" definition of Differential Privacy, i.e., whether an individual is present or absent from the dataset. Please correct me if I'm wrong. The blog https://github.com/frankmcsherry/blog/blob/master/posts/2018-02-25.md uses the sensitivity of 2 assuming that the values of an individual are altered in neighboring datasets.

ganevgv commented 3 years ago

Thanks for your quick reply and fix @haoyueping! I'm closing the issue.

Re: sensitivity 1 or 2 - you're right, it could be either depending on the interpretation. Technically, I think in PrivBayes they use 2, but personally I'm okay with either.