Closed Leobouloc closed 7 years ago
Take a random sample of the records, make a dictionary of those and pass them to sample, You'll also need to use the original_length argument
On Fri, Apr 14, 2017 at 6:25 AM, Leobouloc notifications@github.com wrote:
Unless I am mistaken, all the database examples (including address-matching https://github.com/datamade/address-matching/tree/sqlclass and dedupe-geocoder https://github.com/datamade/dedupe-geocoder) load the entire files in memory to generate samples to label.
For example: dict((i, row) for i, row in enumerate(cur)) : https://github.com/datamade/dedupe-examples/blob/master/ pgsql_big_dedupe_example/pgsql_big_dedupe_example.py#L121
In my case the canonical file is a 1GB csv file and my RAM fills up before I can load the file.
Do you have code to link (Gazetteer) large files without having to load the entire file in memory (even for sample generation)?
Otherwise, do you think doing sample generation using only a subset of the canonical file would work and what implications would that have?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/datamade/dedupe-examples/issues/48, or mute the thread https://github.com/notifications/unsubscribe-auth/AAgxbUDF4oNONdoaj6ug-2BmVlyqZm1dks5rv1eugaJpZM4M9rB- .
-- 773.888.2718
I had missed that. Thank you !
Unless I am mistaken, all the database examples (including address-matching and dedupe-geocoder) load the entire files in memory to generate samples to label.
For example:
dict((i, row) for i, row in enumerate(cur))
: https://github.com/datamade/dedupe-examples/blob/master/pgsql_big_dedupe_example/pgsql_big_dedupe_example.py#L121In my case the canonical file is a 1GB csv file and my RAM fills up before I can load the file.
Do you have code to link (Gazetteer) large files without having to load the entire file in memory (even for sample generation)?
Otherwise, do you think doing sample generation using only a subset of the canonical file would work and what implications would that have?