Tesorio / django-anon

:shipit: Anonymize production data so it can be safely used in not-so-safe environments
https://django-anon.readthedocs.io/en/latest/
MIT License
160 stars 5 forks source link

Using a number generator on an iterator seems questionable #49

Closed budlight closed 4 years ago

budlight commented 4 years ago

https://github.com/Tesorio/django-anon/blob/master/anon/utils.py#L217

I had issues with this using anon.fake_phone_number() producing duplicate phone numbers (unique=True)

caioariede commented 4 years ago

@budlight the utilities to produce fake data will not guarantee uniqueness, you can achieve that using a custom _cycle_over_sample_range variation, which is a private/internal API thus not documented:

from anon.utils import _cycle_over_sample_range

# sample of 1_000_000 unique phone numbers (without formatting)
random_phone_number = _cycle_over_sample_range(1000000000, 9999999999, 1000000)

# example
print(next(random_phone_numbers)) # 1761296935
print(next(random_phone_numbers)) # 6273149983
print(next(random_phone_numbers)) # 6225408363
...
budlight commented 4 years ago

that is actually better than what I went with initially

mobile = partial(faker.numerify, text='+1##########')

are you really so concerned about performance that you think that should not be the default for phone number?

caioariede commented 4 years ago

Even if we change it, we won’t be able to address uniqueness

budlight commented 4 years ago

yeah that is just a problem with the bulk update in general, the only way I could have solved this 100% correctly is with the ability to set all the values to None before using _cycle_over_sample_range(1000000000, 9999999999, 1000000)

probably the best solution is to clear all the unique values beforehand. That's all I could even see as a real improvement here is documentation suggesting to clear unique values if they accept None.