Tesorio / django-anon

:shipit: Anonymize production data so it can be safely used in not-so-safe environments
https://django-anon.readthedocs.io/en/latest/
MIT License
161 stars 6 forks source link

Update select_chunk_size and update_batch_size to better defaults #45

Closed caioariede closed 4 years ago

caioariede commented 4 years ago

Description

This updates default values for select_chunk_size and update_batch_size to better values, in terms of performance. As those being optional values, it is expected that most people will use those without caring too much about. It's our responsibility to provide/research which values would be good as default.

This PR description contains a little research on how the new values were decided.

Profilling

A sample dataset (SQLite database) was used to perform profiling. The chosen database from kaggle contains information about businesses and businesses owners from Brazil. The criteria used to select the database was: type: SQL, size: > 300m, should contain text columns

Once the database was chosen, a sample Django application was created with models (only one: Socios) being created using inspectdb. Below is the resulting Anonymizer, created from the model and which was used for profiling:

import anon
from .models import Socios

class SociosAnonymizer(anon.BaseAnonymizer):
    nome_empresa = anon.fake_name
    nome_socio = anon.fake_name
    qualificacao_socio = anon.fake_small_text

    class Meta:
        model = Socios

    def get_queryset(self):
        return Socios.objects.filter(id__lt=100001)  # limit to 100k rows

The profiling script below, runs anonymization 10 times for each combination of select_chunk_size and update_batch_size:

from timeit import timeit
from itertools import product

def profile(select_chunk_size, batch_size):
    print(f"Profiling select_chunk_size={select_chunk_size} update_batch_size={batch_size}")
    print(timeit(f"SociosAnonymizer().run(select_chunk_size={select_chunk_size}, batch_size={batch_size})", setup="from app.anonymizers import SociosAnonymizer", number=10))

def run_all():
    values = (50, 100, 200, 500, 1000, 2000, 5000, 10000)
    # select_chunk_size must be higher than update_batch_size
    matrix = [(a, b) for a, b in product(values, values) if a >= b]
    # lower values run first
    matrix = sorted(matrix, key=lambda t: t[0] + t[1])

    for m in matrix:
        profile(*m)

Results

Raw data / logging

Profiling select_chunk_size=50 update_batch_size=50
82.88725873899999
Profiling select_chunk_size=100 update_batch_size=50
75.46806942900001
Profiling select_chunk_size=100 update_batch_size=100
67.236436908
Profiling select_chunk_size=200 update_batch_size=50
71.06274498299999
Profiling select_chunk_size=200 update_batch_size=100
62.45213859500001
Profiling select_chunk_size=200 update_batch_size=200
58.45653642399998
Profiling select_chunk_size=500 update_batch_size=50
79.64461012999999
Profiling select_chunk_size=500 update_batch_size=100
60.47691527599994
Profiling select_chunk_size=500 update_batch_size=200
58.60087235399999
Profiling select_chunk_size=500 update_batch_size=500
55.75410631000011
Profiling select_chunk_size=1000 update_batch_size=50
65.45262689499998
Profiling select_chunk_size=1000 update_batch_size=100
57.77756815999999
Profiling select_chunk_size=1000 update_batch_size=200
57.20208233300002
Profiling select_chunk_size=1000 update_batch_size=500
55.73854682900003
Profiling select_chunk_size=1000 update_batch_size=1000
59.611847409000006
Profiling select_chunk_size=2000 update_batch_size=50
63.94193125300001
Profiling select_chunk_size=2000 update_batch_size=100
57.450949777999995
Profiling select_chunk_size=2000 update_batch_size=200
55.39397706
Profiling select_chunk_size=2000 update_batch_size=500
55.24465250999992
Profiling select_chunk_size=2000 update_batch_size=1000
58.533093904
Profiling select_chunk_size=2000 update_batch_size=2000
70.26242734399989
Profiling select_chunk_size=5000 update_batch_size=50
64.75540471800014
Profiling select_chunk_size=5000 update_batch_size=100
56.104626185999905
Profiling select_chunk_size=5000 update_batch_size=200
54.38015385799986
Profiling select_chunk_size=5000 update_batch_size=500
54.61595388499995
Profiling select_chunk_size=5000 update_batch_size=1000
58.209553556000174
Profiling select_chunk_size=5000 update_batch_size=2000
67.82383367500006
Profiling select_chunk_size=5000 update_batch_size=5000
116.39055973399991
Profiling select_chunk_size=10000 update_batch_size=50
75.9571212410001
Profiling select_chunk_size=10000 update_batch_size=100
57.05863943700001
Profiling select_chunk_size=10000 update_batch_size=200
55.227523180999924
Profiling select_chunk_size=10000 update_batch_size=500
55.403603475999944
Profiling select_chunk_size=10000 update_batch_size=1000
64.09162193899965
Profiling select_chunk_size=10000 update_batch_size=2000
77.74540494799976
Profiling select_chunk_size=10000 update_batch_size=5000
113.76159923799969
Profiling select_chunk_size=10000 update_batch_size=10000
194.02536758299993

Visualization

To better visualize the data, a heatmap was created. The darker greens indicate good performance. Numbers in each cell indicates total seconds to run the specific combination:

Unknown-2

Conclusion

update_batch_size: by looking at the visualization, it's clear that after some point, increasing this value to something higher than 500 can cause performance to drop exponentially. The differences in results using 200 and 500 are minimum.

select_update_size: the higher the better, but we didn't see much gain when increasing this to something higher than 5000 which seems to be the optimal value.

As there is no much difference for a update_batch_size of 200 or 500, it was decided to use 200 as optimal value since it will hold less memory for both: query construction (ORM) and query execution (database)

Ending result:

select_update_size = 5000
update_batch_size = 200

Todos

github-actions[bot] commented 4 years ago
File Coverage
All files 95% :white_check_mark:
anon/init.py 63% :white_check_mark:
anon/base.py 97% :white_check_mark:
anon/utils.py 87% :white_check_mark:
tests/compat.py 50% :white_check_mark:
tests/test_base.py 99% :white_check_mark:

Minimum allowed coverage is 50%

Generated by :monkey: cobertura-action against 81189d57f5d4a67fe015193e70b78fbad4b99b93

FabioFleitas commented 4 years ago

Interesting - but don't you think the performance might be dependent on the database specs?

caioariede commented 4 years ago

@FabioFleitas there are some factors for sure, for example the number of updates you are doing per-record. If you are doing many updates for a single record it is likely required that you reduce update_batch_size so you don't have a huge query which will cause the database to suffer. In the other hand, if your Anonymizer has really small updates you can bump this to a higher value.

In the other hand if you have really big tables, or a lot of memory, you may be looking to use a higher select_update_size to hit the database less times to retrieve the data.

So yeah, this PR doesn't exempt anyone from tweaking those values in order to gain performance, which is a case-to-case approach. Instead it just tries to provide better default values based on some profiling instead of out of thin air.

This was an insight that I have when running anonymization against a large database. After tweaking those values, I noticed that the current combination of update_batch_size and select_update_size was killing performance. After changing it, I was able to gain a similar performance improvement of the one described in this PR, which is ~16%

FabioFleitas commented 4 years ago

Awesome stuff @caioariede