dedupeio / dedupe-examples

:id: Examples for using the dedupe library
MIT License
404 stars 216 forks source link

candidates_gen memory error #55

Closed tendres closed 7 years ago

tendres commented 7 years ago

Running dedupe on a fairly large postgres table and getting a reproducible memory error. Ubuntu 14.04 Python3.6.0 Dedupe 1.6.12 Postgres 9.6.2

When set num_cores = 16 I receive the following error immediately upon starting 'clustering...':

Time to write smaller_coverage:
0.5914955576260884 minutes
clustering...
Traceback (most recent call last):
  File "dd_master2.py", line 312, in <module>
    threshold=0.5)
  File "/home/tom/.pyenv/versions/general/lib/python3.6/site-packages/dedupe/api.py", line 117, in matchBlocks
    threshold=0)
  File "/home/tom/.pyenv/versions/general/lib/python3.6/site-packages/dedupe/core.py", line 204, in scoreDuplicates
    [process.start() for process in map_processes]
  File "/home/tom/.pyenv/versions/general/lib/python3.6/site-packages/dedupe/core.py", line 204, in <listcomp>
    [process.start() for process in map_processes]
  File "/home/tom/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/home/tom/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/tom/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/home/tom/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/home/tom/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/popen_fork.py", line 67, in _launch
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

When num_cores=1 The process get a little further but not by much:

1320000 blocks
9626.000676631927 seconds
1330000 blocks
9626.786726236343 seconds
1340000 blocks
9627.571291208267 seconds
Killed

In my case I found the query that feeds candidates_gen is over 20 million rows and this is where I think things fail. Should this be written to a table and not be reliant upon memory? This looks to be the last hurdle before dedupe could scale to some much larger numbers. Memory usage looked great until this point.

fgregg commented 7 years ago

that query should be a service side cursor. In other words, the data should not be loaded into memory. Is that not happening?

tendres commented 7 years ago

I'm a postgres guy, not Python so I'll do my best... Yes there is a server side cursor - but in pgsql_big_dedupe_example.py #319 are the results for c4 being stored in a temp table? In my case whatever is happening is causing tons of disk swap (kswapd) while slowly taking over the machine until hours later when the system kills it. In keeping with the architecture of this example in using a db thought this area of code might be a candidate for a table to alleviate the memory issues. Let me know what other information might be useful.

tendres commented 7 years ago

Closing this - have not seen this as of late.