dimroc / etl-language-comparison

Count the number of times certain words were said in a particular neighborhood. Performed as a basic MapReduce job against 25M tweets. Implemented with different programming languages as a educational exercise.
http://blog.dimroc.com/2015/11/14/etl-language-showdown-pt3/
186 stars 33 forks source link

Update to Elixir 1.0 #2

Closed josevalim closed 10 years ago

josevalim commented 10 years ago

Hello @dimroc!

I took the liberty of upgrading your project to Elixir 1.0.

Meanwhile I have noticed two things:

  1. You were using a Map on the reduce stage and currently maps can only handle efficiently a small set of keys. Most of your performance slow down came from this and I changed it to a HashDict. We plan to improve Maps so they can handle small and large sets of keys efficiently so this is not a common pitfall in the future
  2. The second commit is protocol consolidation. Because of how the VM loads modules, protocol dispatches are expensive during development because we need to check if a new version of the module is available. For production, the recommended approach is to consolidate protocols. So I have updated the script to do so.

Those were very low hanging fruit optimizations and I was able to make the time down to 1m50s on an old macbook pro with 4 cores. If you plan to rerun your tests after this pull request, I would love to know your new results!

dimroc commented 10 years ago

Thanks a lot @josevalim , appreciate you taking the time to give feedback!

Was stoked to see the improvement here especially. :+1:


-    stream = File.stream!(file)
-    |> Stream.map fn line -> String.split(line, "\t") end

-    # Attempted to 'fold' the stream into a map, but couldn't find the appropriate method.
-    # Settled for Enum.to_list.
-    map = List.foldl(Enum.to_list(stream), %{}, &Reducer.reduce_stream/2)

+ map =
+      File.stream!(file)
+      |> Stream.map(fn line -> String.split(line, "\t") end)
+      |> Enum.reduce(@dict, &reduce_stream/2)
josevalim commented 10 years ago

@dimroc ah, i wanted to drop this comment. Stream, maps, lists are all Enumerables, so you can use all of the Enum functions (and Stream) to work with them.