Closed yoid2000 closed 3 days ago
Since this is needed for strings only, it would be best if only the StringConvertor class is affected. You could gather safe values during __init__
(you'll need to somehow get the counter factory object in order to create the right entity counter, like here) and handle the transform back in the from_interval
callback.
Then during generate_microdata, when there is a non-singularity string bucket, we check SafeValues to see if the assigned float is safe, and if so we produce the full string. If not, we produce a * value.
I don't understand this part. If a bucket interval is harvested, that means there are no singularities in that interval that are safe to output.
Since this is needed for strings only, it would be best if only the StringConvertor class is affected. You could gather safe values during init (you'll need to somehow get the counter factory object in order to create the right entity counter, like here) and handle the transform back in the from_interval callback.
This specific issue is for strings, but in future I'll want to apply some kinds of rules to other types (floats and ints I have already in mind). So I'm currently thinking of touching the DataConverter class.
To gather safe values (singularity 1dim leaf nodes), I think the best way will be to walk the 1dim trees. This will be done after self.forest is run, but before materialize_tree. The resulting safe values then need to be handed to StringConvertor.
I don't understand this part. If a bucket interval is harvested, that means there are no singularities in that interval that are safe to output.
The values in the interval are safe to output if they are safe in the 1dim tree (i.e. pass LCF in the 1dim tree). So basically we select a value as usual (at random from the interval), and then we decide if it a safe value or not.
Close with #131
This issue is a continuation of #125
I think the best way to avoid unnecessary
*
values is to explicitly record which values are safe to output, and check this during microdata creation.We know which values are safe to output when building 1dim trees. Any singularity in a 1dim tree that passes LCF is safe to output.
My idea is to add a class
SafeValues
. It contains the safe values for each string columntype. I can populateSafeValues
from walking the finished 1dim trees. Then duringgenerate_microdata
, when there is a non-singularity string bucket, we checkSafeValues
to see if the assigned float is safe, and if so we produce the full string. If not, we produce a*
value.@cristianberneanu what do you think?