JuliaData / CSV.jl

Utility library for working with CSV and other delimited files in the Julia programming language
https://csv.juliadata.org/
Other
473 stars 141 forks source link

Issue when reading file by chunks #959

Open rvignolo-julius opened 2 years ago

rvignolo-julius commented 2 years ago

Hi,

I have approximately 6.5M rows in a database. When loading by chunks, I noticed the following:

using CSV
using DataFrames
using BenchmarkTools
@btime CSV.read("data.csv", DataFrames.DataFrame; skipto=        2, limit=1_000_00, header=true, ntasks=1); # 77.412 ms (1661 allocations: 27.48 MiB)
@btime CSV.read("data.csv", DataFrames.DataFrame; skipto=1_000_002, limit=1_000_00, header=true, ntasks=1); # 13.741 s (165783320 allocations: 2.56 GiB)
@btime CSV.read("data.csv", DataFrames.DataFrame; skipto=2_000_002, limit=1_000_00, header=true, ntasks=1); # 27.617 s (333119407 allocations: 5.11 GiB)
@btime CSV.read("data.csv", DataFrames.DataFrame; skipto=3_000_002, limit=1_000_00, header=true, ntasks=1); # 41.784 s (500520198 allocations: 7.66 GiB)
@btime CSV.read("data.csv", DataFrames.DataFrame; skipto=4_000_002, limit=1_000_00, header=true, ntasks=1); # 56.294 s (667717786 allocations: 10.22 GiB)
@btime CSV.read("data.csv", DataFrames.DataFrame; skipto=5_000_002, limit=1_000_00, header=true, ntasks=1); # 69.886 s (835307531 allocations: 12.77 GiB)
@btime CSV.read("data.csv", DataFrames.DataFrame; skipto=6_000_002, limit=1_000_00, header=true, ntasks=1); # 83.741 s (1002749189 allocations: 15.33 GiB)

The time grows unexpectedly too much. Is there any other approach I could take? Are there unwanted allocations?

Thank you for the amazing work!

rvignolo-julius commented 2 years ago

Hi, any ideas regarding what could be happening? Thanks!