Attempt to improve efficiency of read_parquet by using ChainedVector …

quinnj commented 3 years ago

…for multichunk tables

cc: @tanmaykm @xiaodaigh

This proposes a dedicated Parquet.Table type to be returned from read_parquet. It should be non-breaking, since you can "use" the result the same as you did before; i.e. DataFrame(read_parquet(file)) still works. This avoids the call to reduce(vcat, ...) which can be quite costly for very large files with many chunks. It instead proposes to lazily "vcat" chunks using the SentinelArrays.ChainedVector array type. This is the same approach used by CSV.jl, Arrow.jl, and the general arrow c++/pyarrow projects (though they call their array type ChunkedArray).

This is my first foray into the parquet code, so forgive if I've assumed something wrongly or misstepped. I'm not quite sure on how the testing is setup, but I wanted to put up the PR to get review first and see if anything breaks unintentionally.

xiaodaigh commented 3 years ago

Great idea! It didn't occur to me that Julia ecosystem should define their own types like this but it's the logical approach to not rely on others.

tanmaykm commented 3 years ago

:+1: LGTM! Will add a commit in a bit to make the tests pass.

tanmaykm commented 3 years ago

https://github.com/JuliaIO/Parquet.jl/pull/130 should make the tests pass

tanmaykm commented 3 years ago

With some changes, it seems possible to load the chunks lazily on demand as well. That may be a good future enhancement.

tanmaykm commented 3 years ago

CI passes, will merge this in a bit

JuliaIO / Parquet.jl

Attempt to improve efficiency of read_parquet by using ChainedVector … #128