JuliaIO / Parquet.jl

Julia implementation of Parquet columnar file format reader
Other
119 stars 32 forks source link

Internal error. Incorrect state 8. Expected: (0, 6, 7) #31

Closed xgdgsc closed 4 years ago

xgdgsc commented 5 years ago

When loading a file

using DataFrames
using ParquetFiles

Initially I saw https://github.com/JuliaIO/Parquet.jl/issues/29 this issue. After cloning latest master I see the following error:

Internal error. Incorrect state 8. Expected: (0, 6, 7)

Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] chkstate at /home/gsc/.julia/packages/Thrift/hqiAN/src/protocols.jl:223 [inlined]
 [3] readStructBegin(::Thrift.TCompactProtocol) at /home/gsc/.julia/packages/Thrift/hqiAN/src/protocols.jl:395
 [4] read_container(::Thrift.TCompactProtocol, ::Parquet.PAR2.DictionaryPageHeader) at /home/gsc/.julia/packages/Thrift/hqiAN/src/base.jl:172
 [5] read at /home/gsc/.julia/packages/Thrift/hqiAN/src/base.jl:169 [inlined]
 [6] read(::Thrift.TCompactProtocol, ::Type{Parquet.PAR2.DictionaryPageHeader}) at /home/gsc/.julia/packages/Thrift/hqiAN/src/base.jl:167
 [7] read_container(::Thrift.TCompactProtocol, ::Parquet.PAR2.PageHeader) at /home/gsc/.julia/packages/Thrift/hqiAN/src/base.jl:194
 [8] read at /home/gsc/.julia/packages/Thrift/hqiAN/src/base.jl:169 [inlined]
 [9] read_thrift(::IOStream, ::Type{Parquet.PAR2.PageHeader}) at /home/gsc/.julia/dev/Parquet/src/reader.jl:324
 [10] _pagevec(::Parquet.ParFile, ::Parquet.PAR2.ColumnChunk) at /home/gsc/.julia/dev/Parquet/src/reader.jl:124
 [11] #5 at /home/gsc/.julia/dev/Parquet/src/reader.jl:137 [inlined]
 [12] cacheget(::Parquet.PageLRU, ::Parquet.PAR2.ColumnChunk, ::getfield(Parquet, Symbol("##5#6")){Parquet.ParFile}) at /home/gsc/.julia/dev/Parquet/src/reader.jl:26
 [13] pages at /home/gsc/.julia/dev/Parquet/src/reader.jl:137 [inlined]
 [14] values(::Parquet.ParFile, ::Parquet.PAR2.ColumnChunk) at /home/gsc/.julia/dev/Parquet/src/reader.jl:166
 [15] setrow(::Parquet.ColCursor{Float64}, ::Int64) at /home/gsc/.julia/dev/Parquet/src/cursor.jl:144
 [16] Parquet.ColCursor(::Parquet.ParFile, ::UnitRange{Int64}, ::String, ::Int64) at /home/gsc/.julia/dev/Parquet/src/cursor.jl:115
 [17] (::getfield(Parquet, Symbol("##11#12")){Parquet.ParFile,UnitRange{Int64},Int64})(::String) at ./none:0
 [18] iterate at ./generator.jl:47 [inlined]
 [19] collect(::Base.Generator{Array{AbstractString,1},getfield(Parquet, Symbol("##11#12")){Parquet.ParFile,UnitRange{Int64},Int64}}) at ./array.jl:606
 [20] Parquet.RecCursor(::Parquet.ParFile, ::UnitRange{Int64}, ::Array{AbstractString,1}, ::Parquet.JuliaBuilder{ParquetFiles.RCType363}, ::Int64) at /home/gsc/.julia/dev/Parquet/src/cursor.jl:269 (repeats 2 times)
 [21] getiterator(::ParquetFiles.ParquetFile) at /home/gsc/.julia/packages/ParquetFiles/cLLFb/src/ParquetFiles.jl:74
 [22] columns at /home/gsc/.julia/packages/Tables/8f4rT/src/fallbacks.jl:153 [inlined]
 [23] DataFrame(::ParquetFiles.ParquetFile) at /home/gsc/.julia/packages/DataFrames/IKMvt/src/other/tables.jl:21
 [24] top-level scope at In[3]:1

I don' t know what to do. Please give me some advice.

DrChainsaw commented 5 years ago

I got the same issue using ParquetFiles 0.2.0 and latest master of Parquet.

ParquetFiles:

using ParquetFiles
load(file)
Error showing value of type ParquetFiles.ParquetFile:
ERROR: Internal error. Incorrect state 8. Expected: (0, 6, 7)

Using just Parquet I can load the file:

using Parquet
ParFile(file)
Parquet file: yadada
    version: 1
    nrows: 168000
    created by: parquet-cpp version 1.4.0
    cached: 0 column chunks

Maybe the issue belongs to ParquetFiles? I don't understand enough of the stack trace to determine.

DrChainsaw commented 5 years ago

Trying to create a cursor like this produces the same error (from line [20] in the stacktrace of the OP)

p = ParFile(file)
schema(JuliaConverter(Main), p, :Customer)
rc = RecCursor(p, 1:5, colnames(p), JuliaBuilder(p, Customer))
tanmaykm commented 5 years ago

The reader encountered something unexpected. Possible reasons:

It needs digging into the reader to figure out what it expected to read just before the error.

DrChainsaw commented 5 years ago

Thanks,

I can't share the file I used. I tried a few sample files but could not reproduce the error.

Btw, I'm using windows 10 and julia 1.1 so I might be into "not tried-and-true territory". For instance, I saw an issue in the Thrift repo about windows support.

I managed to read some of the data using Spark.jl, but I had to catch quite a few NPEs so maybe there is something strange with my file. Afaik, it does read without errors in python though.

In case someone wants to try my workaround:

using Spark, JavaCall
Spark.init()

function read_data(file, colfilter=identity)
    ds = read_parquet(SparkSession(), file)
    cns = colfilter(colnames(ds))

    jrows = jcall(ds.jdf, "collectAsList", Spark.JList, ())
    data = [Array{Any,1}() for ignore in cns]
    for jrow in Spark.JavaCall.iterator(jrows)
        try
            for i=colfilter(1:length(jrow))
                push!(data[i],Spark.native_type(Spark.narrow(jrow[i])))
            end
        catch err
            println("Got exception: ", err)
        end
    end
    return NamedTuple{Tuple(cns)}(data)
end

function colnames(dataset::Dataset)
    jcolsnames = jcall(dataset.jdf, "columns", Array{JString, 1}, ())
    return map(x -> Symbol(Spark.native_type(x)), jcolsnames)
end
xiaodaigh commented 4 years ago

I wonder if the issue still persist after the latest update?

xgdgsc commented 4 years ago

I loaded a file and it works. But I' m not sure it was the same file I loaded at this issue.

xiaodaigh commented 4 years ago

In that case i would recommend closing it and if there is an issue open another issue instead of keeping an issue open indefinitely

DrChainsaw commented 4 years ago

I think I have the same set of files laying around. I'll give it a spin when I'm at the computer.

DrChainsaw commented 4 years ago

I still get the same error.

Fwiw I think the files which cause the error are corrupt somehow.

I have managed to read other files generated from the same datasource without problems and as I said above, when using Spark.jl I had to try line by line and catch a few null pointers in the process to read the file.

For all I know, pandas could just have the same silent try/catch built in. Perhaps that could be a nice feature to have though; some kind of ignore_errors flag. Spinning up spark as in my workaround is a bit of a hassle...

Anyways, I won't complain if you close the issue.

xiaodaigh commented 4 years ago

I think I have the same set of files

Are the files shareable? I wouldn't mind digging into a it using the low level constructs to see if it's broken or corrupted.

DrChainsaw commented 4 years ago

Thanks. Unfortunately I can't share the files.

If there is anything you want me to check with them I can try to do so, but I understand if you feel it might be too inefficient to be worth doing.

I think it was really just an issue of the pipeline having some kinks in it which were later fixed.

xgdgsc commented 4 years ago

I guess we can close for now.