Smile-SA / gdpr-dump

Utility that creates anonymized database dumps (MySQL only). Provides default config templates for Magento, Drupal and Shopware.
GNU General Public License v3.0
182 stars 47 forks source link

DX: Improve "Maximum retries of 100 reached without finding a unique value." message #109

Closed staabm closed 8 months ago

staabm commented 8 months ago

Preconditions

GdprDump Version: 4.0.3

PHP Version: any

Database Version: any

Steps to reproduce

when running with a complex schema and a big dump the process can take hours. when unqiue-value rules are not properly configured "dump-time" errors can occur like:

mstaab@mst22:/cluster/www/www/www/gdpr-dump$ time ../gdpr-dump.phar config.yaml  > gdpr-dumped.sql.bz2
Maximum retries of 100 reached without finding a unique value.

real    37m4.517s
user    36m45.867s
sys     0m16.900s

the error message itself is not very helpful when a big config file is involved which contains several unique rules.

Expected result

a error message, which denotes which column in which table could not be uniquely randomized

Actual result

Maximum retries of 100 reached without finding a unique value.

guvra commented 8 months ago

Fixed by #110

guvra commented 8 months ago

@staabm I forgot to warn you about something

You mentioned creating a dump of a big database, and in another issue, you mentioned using the faker converter. In some situations it can have a noticeable impact on performance, because a lot of faker formatters are not optimized for being used billions of time.

This is why the default templates mainly use custom converters instead of faker.

However, it only starts being problematic when you're using it to convert at least hundred of millions of values.

staabm commented 8 months ago

thanks for the heads up. I am already in the perf optimizing process and looking for bottlenecks.

see e.g.

guvra commented 8 months ago

thanks for the heads up. I am already in the perf optimizing process and looking for bottlenecks.

see e.g.

* [PDO quote bottleneck php/php-src#13440](https://github.com/php/php-src/issues/13440)

* [Inline escape() method to improve performance ifsnop/mysqldump-php#277](https://github.com/ifsnop/mysqldump-php/pull/277)

Nice!

FYI we switched to druidfi/mysqldump a while ago because the original repo was inactive, you might also need to create a PR on their repo.

staabm commented 8 months ago

noted, thanks ;-)

https://github.com/druidfi/mysqldump-php/pull/34

staabm commented 8 months ago

40-50% faster dump with https://github.com/druidfi/mysqldump-php/pull/37

back-2-95 commented 8 months ago

I combined 3 performance related PRs to one for testing: https://github.com/druidfi/mysqldump-php/pull/38

@guvra There is also commands if you wanna test it with this library. My first test was good as time went down from 40s to 5s when creating a dump.

guvra commented 8 months ago

@back-2-95 I made a quick test on a medium size magento database.

So it looks pretty good :+1:

On another note, I used #113 to monitor the total execution time (this PR adds a progress bar to gdpr-dump when using the -v option). The progress bar uses setInfoHook from mysqldump-php to make the progress bar advance. And it doesn't behave like expected, because this hook is triggered after a table was dumped, not before. So when the progress bar displays a table name, the table was actually already dumped, whereas I would expect to see the name of the table that is currently being dumped.