Closed gmingas closed 3 years ago
The correlated columns are more likely to be connected than not. They are not always connected. It depends on two factors:
epsilon
values to make the constructed Bayesian Network less noisy.DataDescriber(histogram_bins=50)
Manually setting the Bayesian network structure is not recommended, since it would break the Difference Privacy requirement.
Thanks a lot for the reply and the useful advice.
On breaking the differential privacy requirement by manually setting the network: Would it break the whole DP requirement or just the part of it that relates to the structure of the network? In other words, if I did that, would I still be able to claim that the algorithm is differentially private but excluding the structure of the network (i.e. the structure is not protected but everything else is protected)?
Yes. After the construction of the Bayesian network, the conditional probability tables will also be injected with noise. So there is still some differential privacy mechanism remaining.
Thanks a lot!
Description
I am running DataSynthesizer on a toy data set with 8 columns and ~10,000 rows. Only 2 of the columns are correlated so I expect to see this correlation in the synthetic data set when running in correlated mode.
I noticed that every time I run (with all parameters constant), the Bayesian netwok that is generated is different - I think this is a result of the greedy algorithm converging to a different solution. The two correlated columns are connected with a parent/child relationship only sometimes. When they are not connected the correlation in the synthetic data is close to zero as expected. I tried to change k from 2 to 4 but the same thing happens and the runtime increases a lot.
As a result, I also cannot compare how DataSynthesiser performs (in terms of correlations) when varying one of the parameter (e.g. epsilon) because when the right relationships are not included the results are completely different.
Is there a way to force the greedy algorithm to connect two of the columns with a parent/child relationship? Or to pass a pre-defined graph, allowing the user to define all relationships? In that case the only task for the algorithm would be the inference of the probability values.
Also, why is the greedy algorithm unable to detect the relationship between the two variables, given that it is the only pair that is correlated in the entire dataset?