dashbitco / nimble_csv

A simple and fast CSV parsing and dumping library for Elixir
https://hexdocs.pm/nimble_csv
772 stars 51 forks source link

Parsing slower after each parse_stream call #47

Closed pera closed 4 years ago

pera commented 4 years ago

Hi! I'm experiencing an odd issue when parsing large CSV files (in the order of hundreds of megabytes): basically if I call f/0 repeatedly, each new call is slower than the previous one

defmodule ElixirBug do
  def f do
    "data.csv"
    |> File.stream!()
    |> NimbleCSV.RFC4180.parse_stream()
    |> Enum.map(fn x -> Enum.at(x, 31) end) # column 31 is sometimes a quoted string
  end

  def test do
    Enum.map(1..5, fn n ->
      {t, x} = :timer.tc(&f/0)
      IO.puts("Round ##{n}: #{t/1_000_000}s")
      x
    end)
  end
end
Erlang/OTP 22 [erts-10.6.4] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]

Compiling 1 file (.ex)                                                                                                 
Interactive Elixir (1.10.1) - press Ctrl+C to exit (type h() ENTER for help)                                           
iex(1)> ElixirBug.test                                                                                                 
Round #1: 4.212789s                                                                                                    
Round #2: 6.072119s                                                                                                    
Round #3: 8.055231s                                                                                                    
Round #4: 9.855778s                                                                                                    
Round #5: 12.035544s                                                                                                   
[                                                                                                                      
  ["RAILROAD R/W", ...

In addition, the iex session also seems slower: if I call recompile it takes a couple of seconds instead of a few milliseconds to finish (and since this library is relatively simple it makes me wonder if this could actually be a problem with BEAM).

josevalim commented 4 years ago

Can you please include your Elixir and OTP versions as well as a sample project with a CSV file that we can try directly? Thank you.

pera commented 4 years ago

Ah sorry I forgot the most important: https://drive.google.com/file/d/1Hajl4sGbbdiMj2uTw9Ux3hyLng6cEow2/view?usp=sharing

Erlang/OTP 22 [erts-10.6.4] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]

Elixir 1.10.1 (compiled with Erlang/OTP 22)

Thank you!

pera commented 4 years ago

...and here is everything as a project (the previous link was just the csv): https://drive.google.com/file/d/1QMvVb1pw_r7DiW08oQgNU_gMjMHNeOdX/view?usp=sharing

josevalim commented 4 years ago

This is happening because you are keeping the whole reference to the large binary as you iterate. This means a lot of allocations in memory, which translates to more time spent doing GC, etc. For example, if you change Enum.map to Enum.each, you will see constant times. But if you change |> Enum.map(fn x -> Enum.at(x, 31) end) to |> Enum.map(fn x -> :binary.copy(Enum.at(x, 31)) end), so you no longer keep references to the large binary, then it is both much faster and consistent even in the Enum.map case. We do talk about :binary.copy in our docs, but perhaps we should make it clearer.

josevalim commented 4 years ago

I have changed the docs to use copy by default.

pera commented 4 years ago

Thanks a lot @josevalim, now I have an excuse to learn about Erlang's GC :slightly_smiling_face: