dedupeio / dedupe-examples

:id: Examples for using the dedupe library
MIT License
404 stars 216 forks source link

Parallel blocking for pgsql_big_dedupe_example.py #114

Open fjsj opened 3 years ago

fjsj commented 3 years ago

Related to https://github.com/dedupeio/dedupe/issues/831

Tested with Python 3.7.7 on macOS 10.15.6 (19G2021) and PostgreSQL 12.3.

Running times for blocking only are:

Parallel pgsql_big_dedupe_example_settings.no-indexes:
real    0m58.265s
user    4m42.656s
sys     0m3.991s

Serial pgsql_big_dedupe_example_settings.no-indexes:
real    2m32.944s
user    2m1.261s
sys     0m1.676s

Parallel pgsql_big_dedupe_example_settings.with-indexes:
real    0m44.619s
user    1m58.348s
sys     0m1.527s

Serial pgsql_big_dedupe_example_settings.with-indexes:
real    1m0.310s
user    0m57.828s
sys     0m0.393s

Both versions are available at commit c1c838485f6fee1958e4b4a35be58f5f035a3db5 along with a script for testing (pgsql_big_dedupe_example/test_parallel_vs_serial.sh).

The settings file and the training file are here: settings-and-training.zip. The no-index one was trained with 0.99 recall. The index one with 0.90.