Open GoogleCodeExporter opened 8 years ago
Original comment by lar...@gmail.com
on 4 Nov 2011 at 10:15
Any idea on how you would map the functionality to the map/reduce programming
model ?
Other than this I can see a big problem when trying to do quick lookups in the
data stored in HDFS, as far as I know the lucene support for files in HDFS is
not really that good yet.
That said it would be amazing to be able to use duke in a Hadoop cluster, as
the deduplication problem is even trickier in really big datasets.
Original comment by phleg...@gmail.com
on 5 Jun 2013 at 3:45
Basically, what you'd have to do is to use a blocking scheme. That is, create a
key from each record such that similar records have the same key. Then the
mapper goes Record -> (key, Record), and the reducer goes (key, [Record1,
Record2, Record3]) -> matching record pairs.
I'm thinking of doing this, but need to review the research literature on
creating blocking keys automatically first. Right now, I'm focusing elsewhere.
Original comment by lar...@gmail.com
on 5 Jun 2013 at 3:52
Original issue reported on code.google.com by
lar...@gmail.com
on 4 Sep 2011 at 1:25