Closed tdunning closed 3 years ago
Thinking about this more, it sounds like the file cache is not thread-safe.
Attached is a reliable replication of the problem. I am running Julia 1.6.3 and Parquet:
name = "Parquet"
uuid = "626c502c-15b0-58ad-a749-f091afb673ae"
keywords = ["parquet", "julia", "columnar-storage"]
license = "MIT"
desc = "Julia implementation of parquet columnar file format reader and writer"
version = "0.8.3"
julia -t 10
. include("parquet-threads.jl")
tryit(false)
. All is good. tryit(true)
. Things break.julia
(no threads). include("parquet-threads.jl")
tryit(true)
. The code is trying to use threads, but there is only one thread so the code works.The only difference here is writing parquet files in multiple threads. Ergo, thread-safe problem in Parquet.write_parquet
Just an update here. Tanmay has isolated the problem to the the handling of the Thrift metadata and has produced this MRE:
using Thrift
using Parquet
using Base.Threads
const meta = open("meta.data", "r") do f
Thrift.read(Thrift.TCompactProtocol(Thrift.TFileTransport(f)), Parquet.PAR2.FileMetaData)
end
const dir = mktempdir()
Threads.@threads for i in 1:100
fname = joinpath(dir, "data$i")
open(fname, "w") do io
Thrift.write(Thrift.TCompactProtocol(Thrift.TFileTransport(io)), meta)
end
end
for i in 1:100
fname = joinpath(dir, "data$i")
open(fname, "r") do io
Thrift.read(Thrift.TCompactProtocol(Thrift.TFileTransport(io)), Parquet.PAR2.FileMetaData)
end
end
Here is the metadata file required (compressed because of github limitations) meta.data01.zip
Should be fixed now in Thrift v0.8.3 after https://github.com/tanmaykm/Thrift.jl/pull/68
I confirm that after pulling in the 0.8.3 version of Thrift that my threading test succeeds.
Is there a new release tag coming soon?
Thrift.jl v0.8.3 is already released with the fix and Parquet.jl compatibility is set to v0.8, so just doing a Pkg.update()
to get the new version of Thrift is all that is needed. We do not need a new release of Parquet.
Ahh... of course!
Well played.
I have code where I have several thousand parquet files to write. The general pattern is like this:
As it is here, the code seems to never produce ill-formed files. On the other hand, I find a significant fraction of the files to be corrupted if I change the loop to this
The symptoms of corruption are widely variable, but the following are examples:
or this
or this
Examining the Parquet sources, I find that these are typically impossible conditions (assuming a well-formed input file).
Thoughts?