dimroc / etl-language-comparison

Count the number of times certain words were said in a particular neighborhood. Performed as a basic MapReduce job against 25M tweets. Implemented with different programming languages as a educational exercise.
http://blog.dimroc.com/2015/11/14/etl-language-showdown-pt3/
187 stars 33 forks source link

add nodejs variant #11

Closed ksmth closed 9 years ago

ksmth commented 9 years ago

Unfortunately I just noticed, the results differ from the reference_output, but so does the Go version, so I submitted the PR anyway.

dimroc commented 9 years ago

nice and fast. thanks.

dimroc commented 9 years ago

Hi @ksmth, I notice that on a match, you have the master immediately keep track of the result rather than have the worker keep track of its file's matches before sending that back to the master.

https://github.com/dimroc/etl-language-comparison/blob/master/nodejs/search.js#L43

What this effectively does is skip a large part of the reduction step that involves merging every workers' hash making your implementation faster but breaking the consistency across implementations. It does make it less scalable but with such a small data set I see why you didn't care.

Could you tweak your implementation to have each worker keep track of its local hits and then send the final hash (or object in js terminology) back to the cluster master for reduction?

So change this: https://github.com/dimroc/etl-language-comparison/blob/master/nodejs/search.js#L47

to merge hashes like this: https://github.com/dimroc/etl-language-comparison/blob/master/ruby/mapreduce.rb#L53

The reason for doing it like this in the original implementation is because it more closely reflects the traditional MapReduce design in larger scale systems.