JuliaIO / Parquet.jl

Julia implementation of Parquet columnar file format reader
Other
115 stars 32 forks source link

Out of Bounds Error on a Parquet file containing Int96 #27

Closed Deduction42 closed 4 years ago

Deduction42 commented 5 years ago

In trying to work through your example with one of my own parquet files,

using Parquet PQ = ParFile("MyFile.parquet") schema(JuliaConverter(Main), PQ, :T_TREND) rc = RecCursor(PQ, 1:5, colnames(PQ), JuliaBuilder(PQ, T_TREND))

I am running into this error (which I also run into when I use ParquetFiles.jl)

ERROR: LoadError: BoundsError: attempt to access 78497-element Array{Int128,1} at index [86164] Stacktrace: [1] getindex at .\array.jl:731 [inlined] [2] #7 at .\none:0 [inlined] [3] iterate at .\generator.jl:47 [inlined] [4] collect_to! at .\array.jl:656 [inlined] [5] collect_to_with_first!(::Array{Int128,1}, ::Int128, ::Base.Generator{Array{Int32,1},getfield(Pa rquet, Symbol("##7#8")){Array{Int128,1}}}, ::Int64) at .\array.jl:643 [6] collect at .\array.jl:624 [inlined] [7] map_dict_vals(::Array{Int128,1}, ::Array{Int32,1}) at C:\Users\XXX.julia\packages\Parquet\Hx yMJ\src\reader.jl:160 [8] macro expansion at .\logging.jl:310 [inlined] [9] values(::ParFile, ::Parquet.PAR2.ColumnChunk) at C:\Users\XXX.julia\packages\Parquet\HxyMJ\s rc\reader.jl:180 [10] setrow(::ColCursor{Int128}, ::Int64) at C:\Users\XXX.julia\packages\Parquet\HxyMJ\src\curso r.jl:144 [11] ColCursor(::ParFile, ::UnitRange{Int64}, ::String, ::Int64) at C:\Users\XXX.julia\packages\ Parquet\HxyMJ\src\cursor.jl:115 [12] (::getfield(Parquet, Symbol("##11#12")){ParFile,UnitRange{Int64},Int64})(::String) at .\none:0 [13] iterate at .\generator.jl:47 [inlined] [14] collect_to!(::Array{ColCursor{Array{UInt8,1}},1}, ::Base.Generator{Array{AbstractString,1},get field(Parquet, Symbol("##11#12")){ParFile,UnitRange{Int64},Int64}}, ::Int64, ::Int64) at .\array.jl: 656 [15] collect_to_with_first!(::Array{ColCursor{Array{UInt8,1}},1}, ::ColCursor{Array{UInt8,1}}, ::Ba se.Generator{Array{AbstractString,1},getfield(Parquet, Symbol("##11#12")){ParFile,UnitRange{Int64},I nt64}}, ::Int64) at .\array.jl:643 [16] collect(::Base.Generator{Array{AbstractString,1},getfield(Parquet, Symbol("##11#12")){ParFile, UnitRange{Int64},Int64}}) at .\array.jl:624 [17] RecCursor(::ParFile, ::UnitRange{Int64}, ::Array{AbstractString,1}, ::JuliaBuilder{T_TREND}, : :Int64) at C:\Users\XXX.julia\packages\Parquet\HxyMJ\src\cursor.jl:269 (repeats 2 times) [18] top-level scope at none:0 in expression starting at C:\Users\XXX\Desktop\Analysis\Main.jl:15

Now I believe this might have something to do with the fact that one of the columns is formatted as an Int96. By using shema I found this

spark_schema {
  optional INT96 time
  optional DOUBLE smh
  optional BYTE_ARRAY name# (from UTF8)
  optional DOUBLE value
  optional BYTE_ARRAY unit# (from UTF8)
  optional BYTE_ARRAY condition# (from UTF8)
  optional BYTE_ARRAY type# (from UTF8)
  optional BYTE_ARRAY source# (from UTF8)
  optional BYTE_ARRAY serialnumber# (from UTF8)
}

In addition I'm not sure of the dataset sizes this tool was tested on. My Parquet file has 20+ Million rows.

Deduction42 commented 5 years ago

Just as an update, I tried a smaller file (that was similarly formatted) and I'm getting the same out of bounds error on a different scale:

LoadError: BoundsError: attempt to access 65137-element Array{Int128,1} at index [65281]

My guess is that something lost track of the array size when the Int96 Parquet object was converted to Int128 in Julia.

tanmaykm commented 5 years ago

Thanks for reporting. It would help if you could share a sample par file to replicate the issue.

Also note: https://issues.apache.org/jira/browse/PARQUET-323

Deduction42 commented 5 years ago

I found another issue with a different parquet file

https://github.com/queryverse/ParquetFiles.jl/issues/9#issue-399120319

Both this file and the previous file worked flawlessly in Python. I have the ability to send you this file. How did you want me to send it?

tanmaykm commented 5 years ago

Great. I can maybe download from Google drive or Dropbox, if that works for you?

Deduction42 commented 5 years ago

Try this. It's a parquet file that is generated by Azure Time Series Insights, so you definitely want to make sure those kinds of files work.

https://drive.google.com/open?id=1Z6f9y2ZRR6jBgDJGIOrmbzqq6c7PtTIY

tanmaykm commented 5 years ago

With this file, even schema parsing fails:

julia> using Parquet

julia> PQ = ParFile("TestFile.parquet")
Parquet file: TestFile.parquet
    version: 1
    nrows: 613148
    created by: parquet-cpp version 1.1.1-SNAPSHOT
    cached: 0 column chunks

julia> schema(JuliaConverter(Main), PQ, :T_TREND)
ERROR: Base.Meta.ParseError("extra token \"Vector\" after end of expression")
Stacktrace:
 [1] #parse#1(::Bool, ::Bool, ::Bool, ::Function, ::String, ::Int64) at ./meta.jl:129
 [2] #parse at ./none:0 [inlined]
 [3] #parse#4(::Bool, ::Bool, ::Function, ::String) at ./meta.jl:164
 [4] parse at ./meta.jl:164 [inlined]
 [5] schema_to_julia_types(::Module, ::Parquet.Schema, ::Symbol) at /home/tan/.julia/dev/Parquet/src/schema.jl:230
 [6] schema(::JuliaConverter, ::Parquet.Schema, ::Symbol) at /home/tan/.julia/dev/Parquet/src/schema.jl:224
 [7] schema(::JuliaConverter, ::ParFile, ::Symbol) at /home/tan/.julia/dev/Parquet/src/reader.jl:66
 [8] top-level scope at none:0
Deduction42 commented 5 years ago

I realized that this is a legitimately difficult type of file to work with. When working with azure, read_parquet_file(path=x) did not work, but read_parquet_datset(path=x) did

The documentation below mentions why they are different.

A Parquet Dataset is different from a Parquet file in that it could be a Folder containing a number of Parquet Files. It could also have a hierarchical structure that partitions the data by the value of a column. These more complex forms of Parquet data are produced commonly by Spark/HIVE. read_parquet_dataset will read these more complex datasets using pyarrow which handle complex Parquet layouts well. It will also handle single Parquet files, or folders full of only single Parquet files, though these are better read using read_parquet_file as it doesn't use pyarrow for reading and should be significantly faster than use pyarrow.

Deduction42 commented 5 years ago

In addition, I have tested this file using multiple toolsets:

a. Julia’s parquet reader couldn’t handle it b. Apache parquet viewer couldn’t handle it (error in reading from files with more than one row group) c. Azure’s >> dprep.read_parquet_file(path=Location) couldn’t handle it d. Azure’s >> dprep.read_parquet_dataset(path=Location) COULD handle it e. PANDAS >> pd.read_parquet('TestFile.parquet') COULD handle it

I'm not sure if reading this type of parquet file is out of scope or not; however, because this is a format that a lot of people will probably be using in the future, this is probably a feature you would want to add.

tanmaykm commented 4 years ago

Probably fixed by #42.

xiaodaigh commented 4 years ago

if reading this type of parquet file is out of scope or not;

My 2cents is that given Int96 support is deprecated in parquet, Parquet.jl is too resource-constrained to support it.

tanmaykm commented 4 years ago

Int96 timestamps seem to be working fine now. Closing. Please reopen if this is still an issue.