dimroc / etl-language-comparison

Count the number of times certain words were said in a particular neighborhood. Performed as a basic MapReduce job against 25M tweets. Implemented with different programming languages as a educational exercise.
http://blog.dimroc.com/2015/11/14/etl-language-showdown-pt3/
186 stars 33 forks source link

add haskell implementations #34

Closed tippenein closed 6 years ago

tippenein commented 8 years ago

Adds 2 implementations

  1. using a bytestring indice search
  2. with regex (regex-tdfa)

The results are included in benchmark.prof and on my machine are: regex - 7.652 s (7.358 s .. 7.819 s) indice - 2.887 s (2.830 s .. 2.919 s)

The Makefile includes the commands needed to run these benchmarks yourself. (make benchmark)

Thanks to @Gabriel439 for the majority of this implementation - here

edit: Added memory command to makefile to show memory usage. indice: 28 MB total memory in use regex: 9 MB total memory in use

dimroc commented 8 years ago

Thanks for the contribution guys. I especially love the memory consumption data. I've been hoping to do that on quite a few other implementations.

Two things to note:

  1. Output isn't sorted. All other implementations sort the output in descending order, with highest matching neighborhood being at the top. This does affect benchmarks.
  2. regex results seem off. According to tmp/haskell_regex_results.txt, park-slope-gowanus mentioned the knicks 119016 times. The regex results don't match the index result. haskell_indice_results.txt shows park-slope-gowanus has 258 matches which is correct.

Feel free to run one of the other implementations and compare results. Once you have the output sorted, you can just diff against the other outputs. I've attached the regex result below for you to see what I see.

haskell_regex_results.txt

Looking forward to the next commit.

tippenein commented 7 years ago

Regex results were a mistake in taking the Right result from an Either instead of checking the Right's Maybe. 691d50f

Sorting didn't have much effect on the time, but I've actually gotten a weaker CPU since the first time I ran this perf :smile:

Files are identical and sorted

It's been ~1 year since I touched this, but I actually came back to this code recently for some processing I needed to do at work, so... here it is :+1:

dimroc commented 6 years ago

🎉