DataResponsibly / DataSynthesizer

MIT License
257 stars 85 forks source link

Generated Bayesian network relationships are different in successive runs #32

Closed gmingas closed 3 years ago

gmingas commented 3 years ago

Description

I am running DataSynthesizer on a toy data set with 8 columns and ~10,000 rows. Only 2 of the columns are correlated so I expect to see this correlation in the synthetic data set when running in correlated mode.

I noticed that every time I run (with all parameters constant), the Bayesian netwok that is generated is different - I think this is a result of the greedy algorithm converging to a different solution. The two correlated columns are connected with a parent/child relationship only sometimes. When they are not connected the correlation in the synthetic data is close to zero as expected. I tried to change k from 2 to 4 but the same thing happens and the runtime increases a lot.

As a result, I also cannot compare how DataSynthesiser performs (in terms of correlations) when varying one of the parameter (e.g. epsilon) because when the right relationships are not included the results are completely different.

Is there a way to force the greedy algorithm to connect two of the columns with a parent/child relationship? Or to pass a pre-defined graph, allowing the user to define all relationships? In that case the only task for the algorithm would be the inference of the probability values.

Also, why is the greedy algorithm unable to detect the relationship between the two variables, given that it is the only pair that is correlated in the entire dataset?

haoyueping commented 3 years ago

The correlated columns are more likely to be connected than not. They are not always connected. It depends on two factors:

  1. The mutual information (MI) between the columns. Please check whether their MI is significantly greater than other column pairs. Even so, there is still a chance that the correlated columns are not connected.
  2. The amount of injected noise controlled by epsilon. You may try greater epsilon values to make the constructed Bayesian Network less noisy.
  3. DataSynthesizer is designed mainly for categorical values. For numerical values, DataSynthesizer first groups them into interval bins, then calculate MI of the binned values. In this case, You may increase the number of bins by DataDescriber(histogram_bins=50)

Manually setting the Bayesian network structure is not recommended, since it would break the Difference Privacy requirement.

gmingas commented 3 years ago

Thanks a lot for the reply and the useful advice.

On breaking the differential privacy requirement by manually setting the network: Would it break the whole DP requirement or just the part of it that relates to the structure of the network? In other words, if I did that, would I still be able to claim that the algorithm is differentially private but excluding the structure of the network (i.e. the structure is not protected but everything else is protected)?

haoyueping commented 3 years ago

Yes. After the construction of the Bayesian network, the conditional probability tables will also be injected with noise. So there is still some differential privacy mechanism remaining.

gmingas commented 3 years ago

Thanks a lot!