Closed tendres closed 7 years ago
that query should be a service side cursor. In other words, the data should not be loaded into memory. Is that not happening?
I'm a postgres guy, not Python so I'll do my best... Yes there is a server side cursor - but in pgsql_big_dedupe_example.py #319 are the results for c4 being stored in a temp table? In my case whatever is happening is causing tons of disk swap (kswapd) while slowly taking over the machine until hours later when the system kills it. In keeping with the architecture of this example in using a db thought this area of code might be a candidate for a table to alleviate the memory issues. Let me know what other information might be useful.
Closing this - have not seen this as of late.
Running dedupe on a fairly large postgres table and getting a reproducible memory error. Ubuntu 14.04 Python3.6.0 Dedupe 1.6.12 Postgres 9.6.2
When set num_cores = 16 I receive the following error immediately upon starting 'clustering...':
When num_cores=1 The process get a little further but not by much:
In my case I found the query that feeds candidates_gen is over 20 million rows and this is where I think things fail. Should this be written to a table and not be reliant upon memory? This looks to be the last hurdle before dedupe could scale to some much larger numbers. Memory usage looked great until this point.