dashbitco / nimble_csv

A simple and fast CSV parsing and dumping library for Elixir
https://hexdocs.pm/nimble_csv
772 stars 51 forks source link

parse_stream vs parse_string #73

Closed klangner closed 1 year ago

klangner commented 1 year ago

I was comparing the normal version of CSV parser:

airports_csv()
    |> File.read!()
    |> CSV.parse_string()
    |> Enum.map(fn row ->
      %{
        id: Enum.at(row, 0),
        type: Enum.at(row, 2),
        name: Enum.at(row, 3),
        country: Enum.at(row, 8)
      }
    end)
    |> Enum.reject(&(&1.type == "closed"))

with the stream version

airports_csv()
    |> File.stream!()
    |> CSV.parse_stream()
    |> Stream.map(fn row ->
      %{
        id: :binary.copy(Enum.at(row, 0)),
        type: :binary.copy(Enum.at(row, 2)),
        name: :binary.copy(Enum.at(row, 3)),
        country: :binary.copy(Enum.at(row, 8))
      }
    end)
    |> Stream.reject(&(&1.type == "closed"))
    |> Enum.to_list()

And while measure it with :timer.tc/1 I have notice that the stream version is much slower. first version takes around 3 second, while the stream version 44 seconds. I was expecting the stream version to be faster (below 1 second). Am I doing something wrong here?

I'm using :nimble_csv, "~> 1.2"

BTW this example is taken from the "Concurrent Data Processing with Elixir" where the stream version is 5 times faster.

josevalim commented 1 year ago

How did you benchmark? Make sure to use a tool like Benchee for benchmarking so you get accurate numbers.

But generally speaking stream versions are slower because they trade less memory usage for higher CPU. But it should not be by that much.

klangner commented 1 year ago

Thanks for the answer. I have just benchmarked it with Benchee. And here are the results:

parse_string

Name            ips        average  deviation         median         99th %
naive          0.35         2.84 s    ±11.24%         2.84 s         3.07 s

parse_stream

Name             ips        average  deviation         median         99th %
stream          2.00      499.84 ms     ±0.69%      500.08 ms      504.89 ms

So it looks like stream version is faster. Not sure why, since I only added dependency, but it looks ok and also works ok. So I will close this issue. Thanks again for the time.