Closed xgdgsc closed 4 years ago
I got the same issue using ParquetFiles 0.2.0 and latest master of Parquet.
ParquetFiles:
using ParquetFiles
load(file)
Error showing value of type ParquetFiles.ParquetFile:
ERROR: Internal error. Incorrect state 8. Expected: (0, 6, 7)
Using just Parquet I can load the file:
using Parquet
ParFile(file)
Parquet file: yadada
version: 1
nrows: 168000
created by: parquet-cpp version 1.4.0
cached: 0 column chunks
Maybe the issue belongs to ParquetFiles? I don't understand enough of the stack trace to determine.
Trying to create a cursor like this produces the same error (from line [20] in the stacktrace of the OP)
p = ParFile(file)
schema(JuliaConverter(Main), p, :Customer)
rc = RecCursor(p, 1:5, colnames(p), JuliaBuilder(p, Customer))
The reader encountered something unexpected. Possible reasons:
It needs digging into the reader to figure out what it expected to read just before the error.
Thanks,
I can't share the file I used. I tried a few sample files but could not reproduce the error.
Btw, I'm using windows 10 and julia 1.1 so I might be into "not tried-and-true territory". For instance, I saw an issue in the Thrift repo about windows support.
I managed to read some of the data using Spark.jl, but I had to catch quite a few NPEs so maybe there is something strange with my file. Afaik, it does read without errors in python though.
In case someone wants to try my workaround:
using Spark, JavaCall
Spark.init()
function read_data(file, colfilter=identity)
ds = read_parquet(SparkSession(), file)
cns = colfilter(colnames(ds))
jrows = jcall(ds.jdf, "collectAsList", Spark.JList, ())
data = [Array{Any,1}() for ignore in cns]
for jrow in Spark.JavaCall.iterator(jrows)
try
for i=colfilter(1:length(jrow))
push!(data[i],Spark.native_type(Spark.narrow(jrow[i])))
end
catch err
println("Got exception: ", err)
end
end
return NamedTuple{Tuple(cns)}(data)
end
function colnames(dataset::Dataset)
jcolsnames = jcall(dataset.jdf, "columns", Array{JString, 1}, ())
return map(x -> Symbol(Spark.native_type(x)), jcolsnames)
end
I wonder if the issue still persist after the latest update?
I loaded a file and it works. But I' m not sure it was the same file I loaded at this issue.
In that case i would recommend closing it and if there is an issue open another issue instead of keeping an issue open indefinitely
I think I have the same set of files laying around. I'll give it a spin when I'm at the computer.
I still get the same error.
Fwiw I think the files which cause the error are corrupt somehow.
I have managed to read other files generated from the same datasource without problems and as I said above, when using Spark.jl I had to try line by line and catch a few null pointers in the process to read the file.
For all I know, pandas could just have the same silent try/catch built in. Perhaps that could be a nice feature to have though; some kind of ignore_errors flag. Spinning up spark as in my workaround is a bit of a hassle...
Anyways, I won't complain if you close the issue.
I think I have the same set of files
Are the files shareable? I wouldn't mind digging into a it using the low level constructs to see if it's broken or corrupted.
Thanks. Unfortunately I can't share the files.
If there is anything you want me to check with them I can try to do so, but I understand if you feel it might be too inefficient to be worth doing.
I think it was really just an issue of the pipeline having some kinks in it which were later fixed.
I guess we can close for now.
When loading a file
Initially I saw https://github.com/JuliaIO/Parquet.jl/issues/29 this issue. After cloning latest master I see the following error:
I don' t know what to do. Please give me some advice.