Closed ksmth closed 9 years ago
nice and fast. thanks.
Hi @ksmth, I notice that on a match, you have the master immediately keep track of the result rather than have the worker keep track of its file's matches before sending that back to the master.
https://github.com/dimroc/etl-language-comparison/blob/master/nodejs/search.js#L43
What this effectively does is skip a large part of the reduction step that involves merging every workers' hash making your implementation faster but breaking the consistency across implementations. It does make it less scalable but with such a small data set I see why you didn't care.
Could you tweak your implementation to have each worker keep track of its local hits and then send the final hash (or object in js terminology) back to the cluster master for reduction?
So change this: https://github.com/dimroc/etl-language-comparison/blob/master/nodejs/search.js#L47
to merge hashes like this: https://github.com/dimroc/etl-language-comparison/blob/master/ruby/mapreduce.rb#L53
The reason for doing it like this in the original implementation is because it more closely reflects the traditional MapReduce design in larger scale systems.
Unfortunately I just noticed, the results differ from the
reference_output
, but so does the Go version, so I submitted the PR anyway.