datatonic / duke

Automatically exported from code.google.com/p/duke
0 stars 0 forks source link

Support for Hadoop processing #36

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Longer-term we should be able to farm out processing work to Hadoop clusters.

Original issue reported on code.google.com by lar...@gmail.com on 4 Sep 2011 at 1:25

GoogleCodeExporter commented 8 years ago

Original comment by lar...@gmail.com on 4 Nov 2011 at 10:15

GoogleCodeExporter commented 8 years ago
Any idea on how you would map the functionality to the map/reduce programming 
model ? 

Other than this I can see a big problem when trying to do quick lookups in the 
data stored in HDFS, as far as I know the lucene support for files in HDFS is 
not really that good yet. 

That said it would be amazing to be able to use duke in a Hadoop cluster, as 
the deduplication problem is even trickier in really big datasets. 

Original comment by phleg...@gmail.com on 5 Jun 2013 at 3:45

GoogleCodeExporter commented 8 years ago
Basically, what you'd have to do is to use a blocking scheme. That is, create a 
key from each record such that similar records have the same key. Then the 
mapper goes Record -> (key, Record), and the reducer goes (key, [Record1, 
Record2, Record3]) -> matching record pairs.

I'm thinking of doing this, but need to review the research literature on 
creating blocking keys automatically first. Right now, I'm focusing elsewhere.

Original comment by lar...@gmail.com on 5 Jun 2013 at 3:52