Closed Deduction42 closed 4 years ago
Just as an update, I tried a smaller file (that was similarly formatted) and I'm getting the same out of bounds error on a different scale:
LoadError: BoundsError: attempt to access 65137-element Array{Int128,1} at index [65281]
My guess is that something lost track of the array size when the Int96 Parquet object was converted to Int128 in Julia.
Thanks for reporting. It would help if you could share a sample par file to replicate the issue.
Also note:
I found another issue with a different parquet file
Both this file and the previous file worked flawlessly in Python. I have the ability to send you this file. How did you want me to send it?
Great. I can maybe download from Google drive or Dropbox, if that works for you?
Try this. It's a parquet file that is generated by Azure Time Series Insights, so you definitely want to make sure those kinds of files work.
With this file, even schema parsing fails:
julia> using Parquet
julia> PQ = ParFile("TestFile.parquet")
Parquet file: TestFile.parquet
version: 1
nrows: 613148
created by: parquet-cpp version 1.1.1-SNAPSHOT
cached: 0 column chunks
julia> schema(JuliaConverter(Main), PQ, :T_TREND)
ERROR: Base.Meta.ParseError("extra token \"Vector\" after end of expression")
[1] #parse#1(::Bool, ::Bool, ::Bool, ::Function, ::String, ::Int64) at ./meta.jl:129
[2] #parse at ./none:0 [inlined]
[3] #parse#4(::Bool, ::Bool, ::Function, ::String) at ./meta.jl:164
[4] parse at ./meta.jl:164 [inlined]
[5] schema_to_julia_types(::Module, ::Parquet.Schema, ::Symbol) at /home/tan/.julia/dev/Parquet/src/schema.jl:230
[6] schema(::JuliaConverter, ::Parquet.Schema, ::Symbol) at /home/tan/.julia/dev/Parquet/src/schema.jl:224
[7] schema(::JuliaConverter, ::ParFile, ::Symbol) at /home/tan/.julia/dev/Parquet/src/reader.jl:66
[8] top-level scope at none:0
I realized that this is a legitimately difficult type of file to work with. When working with azure, read_parquet_file(path=x) did not work, but read_parquet_datset(path=x) did
The documentation below mentions why they are different.
A Parquet Dataset is different from a Parquet file in that it could be a Folder containing a number of Parquet Files. It could also have a hierarchical structure that partitions the data by the value of a column. These more complex forms of Parquet data are produced commonly by Spark/HIVE. read_parquet_dataset will read these more complex datasets using pyarrow which handle complex Parquet layouts well. It will also handle single Parquet files, or folders full of only single Parquet files, though these are better read using read_parquet_file as it doesn't use pyarrow for reading and should be significantly faster than use pyarrow.
In addition, I have tested this file using multiple toolsets:
a. Julia’s parquet reader couldn’t handle it b. Apache parquet viewer couldn’t handle it (error in reading from files with more than one row group) c. Azure’s >> dprep.read_parquet_file(path=Location) couldn’t handle it d. Azure’s >> dprep.read_parquet_dataset(path=Location) COULD handle it e. PANDAS >> pd.read_parquet('TestFile.parquet') COULD handle it
I'm not sure if reading this type of parquet file is out of scope or not; however, because this is a format that a lot of people will probably be using in the future, this is probably a feature you would want to add.
Probably fixed by #42.
if reading this type of parquet file is out of scope or not;
My 2cents is that given Int96
support is deprecated in parquet, Parquet.jl is too resource-constrained to support it.
Int96 timestamps seem to be working fine now. Closing. Please reopen if this is still an issue.
In trying to work through your example with one of my own parquet files,
using Parquet PQ = ParFile("MyFile.parquet") schema(JuliaConverter(Main), PQ, :T_TREND) rc = RecCursor(PQ, 1:5, colnames(PQ), JuliaBuilder(PQ, T_TREND))
I am running into this error (which I also run into when I use ParquetFiles.jl)
ERROR: LoadError: BoundsError: attempt to access 78497-element Array{Int128,1} at index [86164] Stacktrace: [1] getindex at .\array.jl:731 [inlined] [2] #7 at .\none:0 [inlined] [3] iterate at .\generator.jl:47 [inlined] [4] collect_to! at .\array.jl:656 [inlined] [5] collect_to_with_first!(::Array{Int128,1}, ::Int128, ::Base.Generator{Array{Int32,1},getfield(Pa rquet, Symbol("##7#8")){Array{Int128,1}}}, ::Int64) at .\array.jl:643 [6] collect at .\array.jl:624 [inlined] [7] map_dict_vals(::Array{Int128,1}, ::Array{Int32,1}) at C:\Users\XXX.julia\packages\Parquet\Hx yMJ\src\reader.jl:160 [8] macro expansion at .\logging.jl:310 [inlined] [9] values(::ParFile, ::Parquet.PAR2.ColumnChunk) at C:\Users\XXX.julia\packages\Parquet\HxyMJ\s rc\reader.jl:180 [10] setrow(::ColCursor{Int128}, ::Int64) at C:\Users\XXX.julia\packages\Parquet\HxyMJ\src\curso r.jl:144 [11] ColCursor(::ParFile, ::UnitRange{Int64}, ::String, ::Int64) at C:\Users\XXX.julia\packages\ Parquet\HxyMJ\src\cursor.jl:115 [12] (::getfield(Parquet, Symbol("##11#12")){ParFile,UnitRange{Int64},Int64})(::String) at .\none:0 [13] iterate at .\generator.jl:47 [inlined] [14] collect_to!(::Array{ColCursor{Array{UInt8,1}},1}, ::Base.Generator{Array{AbstractString,1},get field(Parquet, Symbol("##11#12")){ParFile,UnitRange{Int64},Int64}}, ::Int64, ::Int64) at .\array.jl: 656 [15] collect_to_with_first!(::Array{ColCursor{Array{UInt8,1}},1}, ::ColCursor{Array{UInt8,1}}, ::Ba se.Generator{Array{AbstractString,1},getfield(Parquet, Symbol("##11#12")){ParFile,UnitRange{Int64},I nt64}}, ::Int64) at .\array.jl:643 [16] collect(::Base.Generator{Array{AbstractString,1},getfield(Parquet, Symbol("##11#12")){ParFile, UnitRange{Int64},Int64}}) at .\array.jl:624 [17] RecCursor(::ParFile, ::UnitRange{Int64}, ::Array{AbstractString,1}, ::JuliaBuilder{T_TREND}, : :Int64) at C:\Users\XXX.julia\packages\Parquet\HxyMJ\src\cursor.jl:269 (repeats 2 times) [18] top-level scope at none:0 in expression starting at C:\Users\XXX\Desktop\Analysis\Main.jl:15
Now I believe this might have something to do with the fact that one of the columns is formatted as an Int96. By using shema I found this
In addition I'm not sure of the dataset sizes this tool was tested on. My Parquet file has 20+ Million rows.