Open piccolbo opened 13 years ago
See the paper "processing theta joins in mapreduce"
One possible technique is to do a preliminary job to create a bloom filter with the keys of one or both sides of the join, then perform the join using the bloom filter as, indeed, a filter in the map phase. Adds jobs but moves work to the map side (reportedly faster in real instances)
The equjoin currently in dev is the basic one. Doesn't scale well when one key is predominant, doesn't exploit special cases like one side having a small number of records. There is lot of work that could go into having a better join feature.