Closed caioariede closed 4 years ago
File | Coverage | |
---|---|---|
All files | 95% |
:white_check_mark: |
anon/init.py | 63% |
:white_check_mark: |
anon/base.py | 97% |
:white_check_mark: |
anon/utils.py | 87% |
:white_check_mark: |
tests/compat.py | 50% |
:white_check_mark: |
tests/test_base.py | 99% |
:white_check_mark: |
Minimum allowed coverage is 50%
Generated by :monkey: cobertura-action against 81189d57f5d4a67fe015193e70b78fbad4b99b93
Interesting - but don't you think the performance might be dependent on the database specs?
@FabioFleitas there are some factors for sure, for example the number of updates you are doing per-record. If you are doing many updates for a single record it is likely required that you reduce update_batch_size
so you don't have a huge query which will cause the database to suffer. In the other hand, if your Anonymizer has really small updates you can bump this to a higher value.
In the other hand if you have really big tables, or a lot of memory, you may be looking to use a higher select_update_size
to hit the database less times to retrieve the data.
So yeah, this PR doesn't exempt anyone from tweaking those values in order to gain performance, which is a case-to-case approach. Instead it just tries to provide better default values based on some profiling instead of out of thin air.
This was an insight that I have when running anonymization against a large database. After tweaking those values, I noticed that the current combination of update_batch_size
and select_update_size
was killing performance. After changing it, I was able to gain a similar performance improvement of the one described in this PR, which is ~16%
Awesome stuff @caioariede
Description
This updates default values for
select_chunk_size
andupdate_batch_size
to better values, in terms of performance. As those being optional values, it is expected that most people will use those without caring too much about. It's our responsibility to provide/research which values would be good as default.This PR description contains a little research on how the new values were decided.
Profilling
A sample dataset (SQLite database) was used to perform profiling. The chosen database from kaggle contains information about businesses and businesses owners from Brazil. The criteria used to select the database was:
type: SQL, size: > 300m, should contain text columns
Once the database was chosen, a sample Django application was created with models (only one:
Socios
) being created usinginspectdb
. Below is the resulting Anonymizer, created from the model and which was used for profiling:The profiling script below, runs anonymization 10 times for each combination of
select_chunk_size
andupdate_batch_size
:Results
Raw data / logging
Visualization
To better visualize the data, a heatmap was created. The darker greens indicate good performance. Numbers in each cell indicates total seconds to run the specific combination:
Conclusion
update_batch_size: by looking at the visualization, it's clear that after some point, increasing this value to something higher than
500
can cause performance to drop exponentially. The differences in results using200
and500
are minimum.select_update_size: the higher the better, but we didn't see much gain when increasing this to something higher than
5000
which seems to be the optimal value.As there is no much difference for a
update_batch_size
of200
or500
, it was decided to use200
as optimal value since it will hold less memory for both: query construction (ORM) and query execution (database)Ending result:
Todos