JuliaIO / Parquet.jl

Julia implementation of Parquet columnar file format reader
Other
119 stars 32 forks source link

updated thrift compiled files and fixed reader #58

Closed xiaodaigh closed 4 years ago

cpfiffer commented 4 years ago

This doesn't seem to work using the synthetic data file, which I think should be added to the test suite and tested against (either here or downstream in ParquetFiles.jl).

My code:

using Parquet
using ParquetFiles
using DataFrames

path = "synthetic_data.parquet"

pq = DataFrame(load(path))

The error:

ERROR: LoadError: UndefRefError: access to undefined reference
Stacktrace:
 [1] getproperty(::ParquetFiles.RCType256, ::Symbol) at ./Base.jl:33
 [2] macro expansion at /home/cameron/.julia/packages/ParquetFiles/cLLFb/src/ParquetFiles.jl:48 [inlined]
 [3] iterate(::ParquetFiles.ParquetNamedTupleIterator{NamedTuple{(:x_1, :x_2, :x_3, :colour, :class),Tuple{Float64,Float64,Float64,String,String}},ParquetFiles.RCType256}) at /home/cameron/.julia/packages/ParquetFiles/cLLFb/src/ParquetFiles.jl:40
 [4] iterate at /home/cameron/.julia/packages/Tables/okt7x/src/tofromdatavalues.jl:45 [inlined]
 [5] iterate at ./iterators.jl:139 [inlined]
 [6] iterate at ./iterators.jl:138 [inlined]
 [7] buildcolumns(::Tables.Schema{(:x_1, :x_2, :x_3, :colour, :class),Tuple{Float64,Float64,Float64,String,String}}, ::Tables.IteratorWrapper{ParquetFiles.ParquetNamedTupleIterator{NamedTuple{(:x_1, :x_2, :x_3, :colour, :class),Tuple{Float64,Float64,Float64,String,String}},ParquetFiles.RCType256}}) at /home/cameron/.julia/packages/Tables/okt7x/src/fallbacks.jl:126
 [8] columns at /home/cameron/.julia/packages/Tables/okt7x/src/fallbacks.jl:237 [inlined]
 [9] DataFrame(::ParquetFiles.ParquetFile; copycols::Bool) at /home/cameron/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:40
 [10] DataFrame(::ParquetFiles.ParquetFile) at /home/cameron/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:31
 [11] top-level scope at /home/cameron/.julia/dev/Parquet/tst.jl:8
 [12] include(::String) at ./client.jl:439
 [13] top-level scope at REPL[10]:1
 [14] eval(::Module, ::Any) at ./boot.jl:331
 [15] eval_user_input(::Any, ::REPL.REPLBackend) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/REPL/src/REPL.jl:86
 [16] run_backend(::REPL.REPLBackend) at /home/cameron/.julia/packages/Revise/MgvIv/src/Revise.jl:1023
 [17] top-level scope at REPL[4]:0
in expression starting at /home/cameron/.julia/dev/Parquet/tst.jl:8
xiaodaigh commented 4 years ago

Hmm, i wonder if it needs to be merged with #51 as well. Let me try it once i get back

xiaodaigh commented 4 years ago

I see, the files can't be read by the ParquetFiles implementation which is relying on some row-wise iteration cursor (?not 100 sure). The codebase will take some time to digest and I am not sure if it's worth it. Because, parquet is a columnar-based format and a row-based iteration to read the data seems to defeat the purpose a little. Perhaps the original author can comment.

Bottom line, I will not spend time and energy fixing the row iteration approach. Instead, I will work to put the column-based approach that works in Diban.jl back into Parquet.jl but also release Diban.jl if the PR to parquet.jl takes too long.

davidanthoff commented 4 years ago

I do think we should only merge PRs here that don't break downstream packages, so keeping the row streaming stuff intact would be good.

xiaodaigh commented 4 years ago

Yeah, I will try not to touch the row iterator thing and try to update the column-wise read.

The thing is, if it's broken for 70% of files I tested, are there many ppl dependent on it and if there are why are they?

tanmaykm commented 4 years ago

Thanks @xiaodaigh . I can confirm the changes look fine.

It will be good to have two clean separate commits - one with the changes to update thrift specs, and the other with the fix to reader.jl. Also, add a test for this condition, maybe using the same synthetic_data.parquet. Will wait a bit, if you would like to make those changes.

xiaodaigh commented 4 years ago

Thanks @xiaodaigh . I can confirm the changes look fine.

It will be good to have two clean separate commits - one with the changes to update thrift specs, and the other with the fix to reader.jl. Also, add a test for this condition, maybe using the same synthetic_data.parquet. Will wait a bit, if you would like to make those changes.

Ok. I will break it into two

tanmaykm commented 4 years ago

:+1: Just some rebase to get two separate commits in this PR would be fine too.

xiaodaigh commented 4 years ago

Done

see #64 #63