Closed potash closed 7 years ago
And a higher-level question:
E) Are the results of RecordLink
invariant under swapping data_2
and data_1
? It seems like in principle they should be, but I wonder whether their different roles in blocking affect that.
It sounds like you are asking for documentation on the internal implementation of methods and classes.
While we do want to document the public API, these internal details are not something we want expose and document.
If there's some particular problem that you are having please open up a separate issue, and I can try to point you to the relevant part of the code.
Ok I opened #602 with the pieces that pertain to the public API.
FYI the reason I want to understand the private methods better is that I am writing an SQL version of RecordLink (c.f. https://github.com/dedupeio/dedupe-examples/issues/23).
Can you please provide more documentation on
RecordLink.blocker()
and related methods?Looking at the source, it seems that what happens is:
1) In
_blockData
,data_2
gets indexed and then blocked withtarget=True
. 2) In_blockGenerator
you block data_1 (which is confusingly referred to asmessy_data
even though in this context both datasets are clean) withtarget=False
. 3) You generate blocks containing one record from data_1 and all records of data_2 that share any of its block keys. 4) You seem to manually assign empty set to all of the covered blocks sets for data_2 records.My concrete questions are:
A) what does the
target
argument toblocker()
actually do? B) why don't you have to doself.blocker.indexAll
ondata_1
a.k.a.messy_data
before blocking it, like you do withdata_2
? C) why are the covered block sets empty? couldn't a data_2 record have appeared in an earlier block? i'm looking at thematchBlocks()
documentation for my understanding of the covered blocks set. D) Gazeteer seems to inherit all of this. Can you point me to where the Gazetteer logic differs from RecordLink to allow for multiple records frommessy_data
to match one record fromdata_2
?Thanks!