How to read parquet files, partitioned datasets?

mahiki commented 3 years ago

Sorry if this is a ridiculous question, I am very noob and not good at reading the API.

My use-case is reading partitioned parquet files, I know this is supported from the Apache PyArrow docs with something like:

import pyarrow.parquet as pq
table = pq.read_table('dataset_name')

I tried the very naive:

using Arrow
table = Arrow.Table("path/dataset/col_1=XYZ/part-000.parquet")

ERROR: type Nothing has no field fields
Stacktrace:
 [1] getproperty(x::Nothing, f::Symbol)
   @ Base ./Base.jl:33
 [2] Arrow.Table(bytes::Vector{UInt8}, off::Int64, tlen::Nothing; convert::Bool)
   @ Arrow ~/.julia/packages/Arrow/NxhTc/src/table.jl:300
 [3] Table
   @ ~/.julia/packages/Arrow/NxhTc/src/table.jl:214 [inlined]
 [4] Arrow.Table(str::String, pos::Int64, len::Nothing; kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Arrow ~/.julia/packages/Arrow/NxhTc/src/table.jl:209
 [5] Table (repeats 2 times)
   @ ~/.julia/packages/Arrow/NxhTc/src/table.jl:209 [inlined]
 [6] top-level scope
   @ REPL[102]:1

quinnj commented 3 years ago

As opposed to the python implementation, the Arrow.jl julia package doesn't require arrow-specific integration with other data formats. So to read a parquet file, you can use the Parquet.jl package (which should support partitioned datasets like this), like tbl = read_parquet(filename). You can then convert the parquet data to the arrow format by doing Arrow.write("data.arrow", tbl). This data can then be read back in vai tbl2 = Arrow.Table("data.arrow").

We can perhaps look into providing a way to convert non-arrow data tables directly to Arrow.Table, but as I mentioned, the value isn't as great as other language implementations where conversions have to be done one-by-one.

mahiki commented 3 years ago

Thanks, that could be helpful.

Unfortunately Parquet.jl is failing to recognize partitions and it doesn't support Date data type. (Issues are open for both).

I was looking to Arrow.jl as a potential workaround, but it seems like this is not possible, from what you say.

quinnj commented 3 years ago

Correct; Arrow.jl is for arrow data, not parquet. If you open an issue at the Parquet.jl package, @tanmaykm has been very responsive in the past for fixing things.

tanmaykm commented 3 years ago

Sorry, I was away for a while and have missed the issues on Parquet.jl. Thanks the ping @quinnj !

xiaodaigh commented 3 years ago

As opposed to the python implementation, the Arrow.jl julia package doesn't require arrow-specific integration with other data formats

Same for R. But I think Julia, rightly have a more modular mentality. Since packages in Julia are more composable there is not a need for a "big bang" approach to have all the functionalities stuffed into one big package.

apache / arrow-julia

How to read parquet files, partitioned datasets? #227