diffix / syndiffix

Python implementation of the SynDiffix synthetic data generation mechanism.
Other
4 stars 1 forks source link

Account for non-suppressed nodes during microdata #128

Closed yoid2000 closed 3 days ago

yoid2000 commented 8 months ago

This issue is a continuation of #125

I think the best way to avoid unnecessary * values is to explicitly record which values are safe to output, and check this during microdata creation.

We know which values are safe to output when building 1dim trees. Any singularity in a 1dim tree that passes LCF is safe to output.

My idea is to add a class SafeValues. It contains the safe values for each string columntype. I can populate SafeValues from walking the finished 1dim trees. Then during generate_microdata, when there is a non-singularity string bucket, we check SafeValues to see if the assigned float is safe, and if so we produce the full string. If not, we produce a * value.

@cristianberneanu what do you think?

cristianberneanu commented 8 months ago

Since this is needed for strings only, it would be best if only the StringConvertor class is affected. You could gather safe values during __init__ (you'll need to somehow get the counter factory object in order to create the right entity counter, like here) and handle the transform back in the from_interval callback.

cristianberneanu commented 8 months ago

Then during generate_microdata, when there is a non-singularity string bucket, we check SafeValues to see if the assigned float is safe, and if so we produce the full string. If not, we produce a * value.

I don't understand this part. If a bucket interval is harvested, that means there are no singularities in that interval that are safe to output.

yoid2000 commented 8 months ago

Since this is needed for strings only, it would be best if only the StringConvertor class is affected. You could gather safe values during init (you'll need to somehow get the counter factory object in order to create the right entity counter, like here) and handle the transform back in the from_interval callback.

This specific issue is for strings, but in future I'll want to apply some kinds of rules to other types (floats and ints I have already in mind). So I'm currently thinking of touching the DataConverter class.

To gather safe values (singularity 1dim leaf nodes), I think the best way will be to walk the 1dim trees. This will be done after self.forest is run, but before materialize_tree. The resulting safe values then need to be handed to StringConvertor.

I don't understand this part. If a bucket interval is harvested, that means there are no singularities in that interval that are safe to output.

The values in the interval are safe to output if they are safe in the 1dim tree (i.e. pass LCF in the 1dim tree). So basically we select a value as usual (at random from the interval), and then we decide if it a safe value or not.

yoid2000 commented 3 days ago

Close with #131