dimroc / etl-language-comparison

Count the number of times certain words were said in a particular neighborhood. Performed as a basic MapReduce job against 25M tweets. Implemented with different programming languages as a educational exercise.
http://blog.dimroc.com/2015/11/14/etl-language-showdown-pt3/
187 stars 33 forks source link

Add shell implementation #21

Closed mganss closed 8 years ago

mganss commented 8 years ago

Uses GNU Parallel. Time on my machine (Cygwin): 1.0s :astonished: (compare results in #20)

dimroc commented 8 years ago

You have a much faster computer than me friend. I get ~9.2s.

Thanks for the PR!

mganss commented 8 years ago

A factor close to 10 is weird, because for one thing the results I posted in #20 don't seem to differ substantially from the ones in your second blog post and also the hardware doesn't seem radically different from mine (Xeon E3-1230v2, i.e. a 2012 CPU). I also thought that Cygwin would have to be slower than a "native" UNIX. Maybe it's because of this: GNU grep is 10x faster than Mac grep

dimroc commented 8 years ago

Hi again @mganss, Could you also add a README.md to the shell/ folder with the information in this comment?

dimroc commented 8 years ago

After your last comment, I went ahead and replaced grep with ag ([the silver searcher])(https://github.com/ggreer/the_silver_searcher) and it blazed at 2.6s. Much faster than OSX grep but not quite as fast as your result. Thanks for the recommendation.

mganss commented 8 years ago

Have you also tried GNU grep? Is ag faster?

dimroc commented 8 years ago

I haven't. I was an ack guy, but have since switched to ag. Those are the popular two on Macs.

mganss commented 8 years ago

I'm getting 2.0s with ag -Fi and GNU Parallel. 5.3s without GNU Parallel, although ag claims to search files in parallel and use multicore.

dimroc commented 8 years ago

Can we add all that to the readme?

On Tuesday, November 17, 2015, Michael Ganss notifications@github.com wrote:

I'm getting 2.0s with ag -Fi and GNU Parallel. 5.3s without GNU Parallel, although ag claims to search files in parallel and use multicore.

— Reply to this email directly or view it on GitHub https://github.com/dimroc/etl-language-comparison/pull/21#issuecomment-157346668 .