Closed quinnj closed 3 years ago
Great idea! It didn't occur to me that Julia ecosystem should define their own types like this but it's the logical approach to not rely on others.
:+1: LGTM! Will add a commit in a bit to make the tests pass.
https://github.com/JuliaIO/Parquet.jl/pull/130 should make the tests pass
With some changes, it seems possible to load the chunks lazily on demand as well. That may be a good future enhancement.
CI passes, will merge this in a bit
…for multichunk tables
cc: @tanmaykm @xiaodaigh
This proposes a dedicated
Parquet.Table
type to be returned fromread_parquet
. It should be non-breaking, since you can "use" the result the same as you did before; i.e.DataFrame(read_parquet(file))
still works. This avoids the call toreduce(vcat, ...)
which can be quite costly for very large files with many chunks. It instead proposes to lazily "vcat" chunks using theSentinelArrays.ChainedVector
array type. This is the same approach used by CSV.jl, Arrow.jl, and the general arrow c++/pyarrow projects (though they call their array typeChunkedArray
).This is my first foray into the parquet code, so forgive if I've assumed something wrongly or misstepped. I'm not quite sure on how the testing is setup, but I wanted to put up the PR to get review first and see if anything breaks unintentionally.