dedupeio / dedupe-examples

:id: Examples for using the dedupe library
MIT License
406 stars 214 forks source link

Big Linking: avoiding in-memory load for sample generation? #48

Closed Leobouloc closed 7 years ago

Leobouloc commented 7 years ago

Unless I am mistaken, all the database examples (including address-matching and dedupe-geocoder) load the entire files in memory to generate samples to label.

For example: dict((i, row) for i, row in enumerate(cur)) : https://github.com/datamade/dedupe-examples/blob/master/pgsql_big_dedupe_example/pgsql_big_dedupe_example.py#L121

In my case the canonical file is a 1GB csv file and my RAM fills up before I can load the file.

Do you have code to link (Gazetteer) large files without having to load the entire file in memory (even for sample generation)?

Otherwise, do you think doing sample generation using only a subset of the canonical file would work and what implications would that have?

fgregg commented 7 years ago

Take a random sample of the records, make a dictionary of those and pass them to sample, You'll also need to use the original_length argument

On Fri, Apr 14, 2017 at 6:25 AM, Leobouloc notifications@github.com wrote:

Unless I am mistaken, all the database examples (including address-matching https://github.com/datamade/address-matching/tree/sqlclass and dedupe-geocoder https://github.com/datamade/dedupe-geocoder) load the entire files in memory to generate samples to label.

For example: dict((i, row) for i, row in enumerate(cur)) : https://github.com/datamade/dedupe-examples/blob/master/ pgsql_big_dedupe_example/pgsql_big_dedupe_example.py#L121

In my case the canonical file is a 1GB csv file and my RAM fills up before I can load the file.

Do you have code to link (Gazetteer) large files without having to load the entire file in memory (even for sample generation)?

Otherwise, do you think doing sample generation using only a subset of the canonical file would work and what implications would that have?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/datamade/dedupe-examples/issues/48, or mute the thread https://github.com/notifications/unsubscribe-auth/AAgxbUDF4oNONdoaj6ug-2BmVlyqZm1dks5rv1eugaJpZM4M9rB- .

-- 773.888.2718

fgregg commented 7 years ago

https://dedupe.readthedocs.io/en/latest/API-documentation.html#Dedupe.sample

Leobouloc commented 7 years ago

I had missed that. Thank you !