Closed pera closed 4 years ago
Thanks @pera! Quick question: what happens if you load the whole file into memory and then call NimbleCSV on the binary instead of on the file?
@josevalim ah didn't try that, but yeah now they're parsed almost instantly... so this is something to do with File.stream!? thx
I found another ticket from this year in relation to File.stream!: https://github.com/elixir-lang/elixir/issues/9956
I believe this is more likely an issue with OTP but since I'm experiencing it while using NimbleCSV I thought it would be appropriate to first ask/report it here: when I try to parse a large CSV file (more than a 100000 lines) where almost every row contains at least one string field longer than 64 bytes it takes a very long of time to finish. In comparison, when every field is less or equal than 64 bytes then the parsing is always almost immediate.
Here is what I'm doing to test this behavior:
Example
Version
These are the results I get:
The first file, which I believe only needs heap binaries, is parsed 35x faster than the one that requires refc binaries. Interestingly, the rate of this slowdown is superlinear (eg if the files were 140 thousands lines long the difference would be 50 fold), so last night while playing a bit with all this (and after reading this issue with :binary.split) I found that passing
read_ahead: 1
(for instance) orencoding: :utf8
to File.stream! seems to fix the problem, but I'm not sure why :shrug:Thanks!