dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.08k stars 549 forks source link

Improve RecordLink blocking documentation #601

Closed potash closed 7 years ago

potash commented 7 years ago

Can you please provide more documentation onRecordLink.blocker() and related methods?

Looking at the source, it seems that what happens is:

1) In _blockData, data_2 gets indexed and then blocked with target=True. 2) In _blockGenerator you block data_1 (which is confusingly referred to as messy_data even though in this context both datasets are clean) with target=False. 3) You generate blocks containing one record from data_1 and all records of data_2 that share any of its block keys. 4) You seem to manually assign empty set to all of the covered blocks sets for data_2 records.

My concrete questions are:

A) what does the target argument to blocker() actually do? B) why don't you have to do self.blocker.indexAll on data_1 a.k.a. messy_data before blocking it, like you do with data_2? C) why are the covered block sets empty? couldn't a data_2 record have appeared in an earlier block? i'm looking at the matchBlocks() documentation for my understanding of the covered blocks set. D) Gazeteer seems to inherit all of this. Can you point me to where the Gazetteer logic differs from RecordLink to allow for multiple records from messy_data to match one record from data_2?

Thanks!

potash commented 7 years ago

And a higher-level question:

E) Are the results of RecordLink invariant under swapping data_2 and data_1? It seems like in principle they should be, but I wonder whether their different roles in blocking affect that.

fgregg commented 7 years ago

It sounds like you are asking for documentation on the internal implementation of methods and classes.

While we do want to document the public API, these internal details are not something we want expose and document.

If there's some particular problem that you are having please open up a separate issue, and I can try to point you to the relevant part of the code.

potash commented 7 years ago

Ok I opened #602 with the pieces that pertain to the public API.

FYI the reason I want to understand the private methods better is that I am writing an SQL version of RecordLink (c.f. https://github.com/dedupeio/dedupe-examples/issues/23).