Slow parsing with refc binaries?

pera commented 4 years ago

I believe this is more likely an issue with OTP but since I'm experiencing it while using NimbleCSV I thought it would be appropriate to first ask/report it here: when I try to parse a large CSV file (more than a 100000 lines) where almost every row contains at least one string field longer than 64 bytes it takes a very long of time to finish. In comparison, when every field is less or equal than 64 bytes then the parsing is always almost immediate.

Here is what I'm doing to test this behavior:

Example

defmodule CsvTest do
  def parse(name) do
    Path.join(["priv", name])
    |> File.stream!
    |> NimbleCSV.RFC4180.parse_stream
    |> Enum.map(fn [_, x, _] -> :binary.copy(x) end)
  end

  # 1.csv contains 100000 lines of: 1,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa,3
  # 2.csv contains 100000 lines of: 1,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa,3
  def test do
    Enum.each(~w(1.csv 2.csv), fn name ->
      {t, _} = :timer.tc(&parse/1, [name])
      IO.puts("#{name}: #{t/1_000_000}s")
    end)
  end
end

Version

Erlang/OTP 23 [erts-11.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]

IEx 1.10.3 (compiled with Erlang/OTP 22)

These are the results I get:

1.csv: 0.239626s
2.csv: 8.42937s

The first file, which I believe only needs heap binaries, is parsed 35x faster than the one that requires refc binaries. Interestingly, the rate of this slowdown is superlinear (eg if the files were 140 thousands lines long the difference would be 50 fold), so last night while playing a bit with all this (and after reading this issue with :binary.split) I found that passing read_ahead: 1 (for instance) or encoding: :utf8 to File.stream! seems to fix the problem, but I'm not sure why :shrug:

Thanks!

josevalim commented 4 years ago

Thanks @pera! Quick question: what happens if you load the whole file into memory and then call NimbleCSV on the binary instead of on the file?

pera commented 4 years ago

@josevalim ah didn't try that, but yeah now they're parsed almost instantly... so this is something to do with File.stream!? thx

pera commented 4 years ago

I found another ticket from this year in relation to File.stream!: https://github.com/elixir-lang/elixir/issues/9956

dashbitco / nimble_csv

Slow parsing with refc binaries? #52

Example

Version