dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
3.99k stars 543 forks source link

ever increasing number of subprocesses in the StaticGazetteer #686

Closed FlorianHoppe closed 5 years ago

FlorianHoppe commented 5 years ago

Whenever I call the match function of the StaticGazetteer two new subprocesses are spawned and get never cleaned up. This kills my server quite rapidly.

This is my sample code check_dedupe.py:

import dedupe

# due to privacy concerns I cant provide my trained model:
with open('some_setting_file', 'rb') as model_file:
    dedupe_model = dedupe.StaticGazetteer(model_file)
    datafields_for_dedupe_model = [datafield.name[1:datafield.name.index(':')] for datafield in
                                   dedupe_model.data_model.primary_fields]

entry['some_field'] = 'some data'
record = {'some_id': entry}

dedupe_model.index(record)

matches = dedupe_model.match(record)
input("Press Enter to continue...")
matches = dedupe_model.match(record)
input("Press Enter to continue...")
matches = dedupe_model.match(record)
input("Press Enter to continue...")
matches = dedupe_model.match(record)
input("Press Enter to continue...")

When running this script, I can see with ps -ef that for each match call two subprocesses get created. E.g.:

UID        PID  PPID  C STIME TTY          TIME CMD
datasci+   506    27 18 08:29 pts/0    00:00:02 python3.6 check_dedupe.py
datasci+   510   506  0 08:29 pts/0    00:00:00 python3.6 check_dedupe.py
datasci+   511   506  0 08:29 pts/0    00:00:00 python3.6 check_dedupe.py
datasci+   516   506  0 08:29 pts/0    00:00:00 python3.6 check_dedupe.py
datasci+   517   506  0 08:29 pts/0    00:00:00 python3.6 check_dedupe.py

My running this in a Docker container with a ubuntu:16.04 base image using Python 3.6 (Python 3.6.5 (default, Mar 29 2018, 03:28:50) [GCC 5.4.0 20160609] on linux) and these python packages:

Package                  Version
------------------------ --------
affinegap                1.10
BTrees                   4.5.0
categorical-distance     1.9
datetime-distance        0.1.3
dedupe                   1.9.2
dedupe-hcluster          0.3.5
dedupe-variable-datetime 0.1.5
DoubleMetaphone          0.1
fastcluster              1.1.25
future                   0.16.0
haversine                0.4.5
highered                 0.2.1
Levenshtein-search       1.4.4
numpy                    1.15.0
persistent               4.3.0
pip                      18.0
pyhacrf-datamade         0.2.2
PyLBFGS                  0.2.0.12
python-dateutil          2.7.3
rlr                      2.4.5
setuptools               40.0.0
simplecosine             1.2
simplejson               3.16.0
six                      1.11.0
wheel                    0.31.1
zope.index               4.3.0
zope.interface           4.5.0
fgregg commented 5 years ago

Do those subprocesses persist or do they eventually go away?

FlorianHoppe commented 5 years ago

In my tests they persists for ever.

fgregg commented 5 years ago

Can you add "pool.join()" after this pool.close() and let me know if that resolves the issue for you. https://github.com/dedupeio/dedupe/blob/master/dedupe/core.py#L360

FlorianHoppe commented 5 years ago

That works for me. Thanks for this quick help!

I would have added a pull request for a branch to add this fix, but I got a permission denied when trying to push it to github...

fgregg commented 5 years ago

closed by #690