dimroc / etl-language-comparison

Count the number of times certain words were said in a particular neighborhood. Performed as a basic MapReduce job against 25M tweets. Implemented with different programming languages as a educational exercise.
http://blog.dimroc.com/2015/11/14/etl-language-showdown-pt3/
187 stars 33 forks source link

Standardize algorithms #24

Open fervic opened 8 years ago

fervic commented 8 years ago

I see that contributions have taken different approaches for solving the same problem, so at the end the benchmark is no comparing the language itself.

My suggestion would be to set a guideline for contributing which explains the standard approach, like:

Maybe also allow submitting a non-standard approach that takes advantage of specific language features but keep that one marked as the special one.

So at the end it would be two sets of solutions: (1) the standard that follows the rules and (2) the optimized or non-standard.

dimroc commented 8 years ago

That's fantastic suggestion @fervic, I had similar thoughts that I was going to bring up in my next blog post. Here's what I was going to suggest.

Rules of Reference Implementation

  1. Stream input from files.
  2. Use Regular Expressions to check for the presence of knicks.
  3. Have multiple mappers, but one reducer.
  4. Each individual worker holds its results in a hash and sends that final hash back for reduction.

One suggestion you made that I don't have was to limit the # of workers/threads, but that's not always simple depending on the language and framework. Any other suggestions?