dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.14k stars 549 forks source link

RecordLink.blocker and Gazetteer.blocker create a huge number of blocks #578

Closed ofershar closed 7 years ago

ofershar commented 7 years ago

Both RecordLink.blocker and Gazetteer.blocker produce a huge number of blocks, which makes them practically unusable.

I use a dedupe version downloaded on May 2017 (not sure about the exact version number). Python version is 3.5.3 and the server is RHEL 6.5 .

Some code for reproducing the problem is attached. It's based on pgsql_big_dedupe_example. Please see the "instructions.txt" file for instructions.

Blocker problem.zip

It demonstrates how RecordLink.blocker generates over 200 affiliations to blocks per donor in average, while Gazetteer.blocker generates even more affiliations by far. Compared to that, Dedupe.blocker only generates about 1.5 affiliations to blocks per donor.

A few notes:

  1. They all use the same JSON training file, with about 50 "yes" and 50 "no" records.

  2. Since pgsql_big_dedupe_example contains a single list of donors rather than two lists, the sampling is done by using the same list twice. I know it doesn't make much sense to do record linking between two identical data sets, but I think it's still a valid example for testing the blocking. Originally, I encountered this problem using two different data sets.

  3. Dedupe.blocker.index_fields contains the "address" field, so it is indexed (using Dedupe.blocker.index). However, RecordLink.blocker.index_fields does not contain any fields (although the same JSON training file is used), so no indexing is done. Gazetteer.blocker.index_fields is different from both, containing only the "name" field.

tendres commented 7 years ago

@ofershar I just ran your blocker_problem.py and came back to a 34.1GB blocks.csv - you said yours was ~2.9GB. Just to confirm, do you have 706,030 records in processed_donors?

ofershar commented 7 years ago

Hi, Sorry for the late response - didn't notice your question before. Yes, I have exactly 706,030 records in processed_donors. I've created it using the pgsql_big_dedupe_example example from GitHub. My CSV didn't get THAT big using RecordLink, but it did when I used Gazetteer. Ofer

fgregg commented 7 years ago

closed by 99d35863b52a72b57725c87f74523e87bf9b04fc