RevolutionAnalytics / RHadoop

RHadoop
https://github.com/RevolutionAnalytics/RHadoop/wiki
763 stars 278 forks source link

advanced equijoin #9

Open piccolbo opened 13 years ago

piccolbo commented 13 years ago

The equjoin currently in dev is the basic one. Doesn't scale well when one key is predominant, doesn't exploit special cases like one side having a small number of records. There is lot of work that could go into having a better join feature.

piccolbo commented 13 years ago

See the paper "processing theta joins in mapreduce"

piccolbo commented 12 years ago

One possible technique is to do a preliminary job to create a bloom filter with the keys of one or both sides of the join, then perform the join using the bloom filter as, indeed, a filter in the map phase. Adds jobs but moves work to the map side (reportedly faster in real instances)