JuliaData / Feather.jl

Read and write feather files in pure Julia
https://juliadata.github.io/Feather.jl/stable
Other
109 stars 27 forks source link

Reading missings is twice as slow as reading values #129

Closed cstjean closed 4 years ago

cstjean commented 4 years ago
julia> using DataFrames, Feather, BenchmarkTools

julia> N = 100_000_000;

julia> df1 = DataFrame(x=Union{Float32, Missing}[missing for _ in 1:N]);

julia> df2 = DataFrame(x=Union{Float32, Missing}[1.1 for _ in 1:N]);

julia> Feather.write("test1.feather", df1);

julia> Feather.write("test2.feather", df2);

julia> @btime Feather.materialize("test1.feather");
  1.028 s (436 allocations: 953.69 MiB)

julia> @btime Feather.materialize("test2.feather");
  435.714 ms (421 allocations: 762.96 MiB)
cstjean commented 4 years ago
julia> typeof(Feather.materialize("test2.feather").x)
Array{Float32,1}

I'd forgotten that Feather forgets about missings when there aren't any. That explains it...

Reading a 50% mix is 30% slower than reading all-missing:

julia> df5 = DataFrame(x=Union{Float32, Missing}[rand()<0.5 ? missing : 1.1 for _ in 1:N]);

julia> Feather.write("test5.feather", df5);

julia> @btime Feather.materialize("test5.feather");
  1.434 s (436 allocations: 953.69 MiB)

However that could be explained by poor branch prediction, I suppose? Maybe there isn't anything concrete to be done, I know your code is already highly optimized.

ExpandingMan commented 4 years ago

There is a lot more overhead for reading and writing arrays with missings and arrays without. This is just because of how the arrow format works.

I wouldn't say anything here is "highly optimized", but I have done lots of basic performance sanity checks (for reading at least). Reading arrays without missings is extremely simple, and is therefore pretty much guaranteed to be maximally efficient. Reading arrays with missings is a lot more complicated, so it's much harder for me to state with any confidence whether it's close to saturating the theoretical upper limit on performance.

I'm not entirely sure why reading all missings is faster, but it may have something to do with the Julia type system (since the eltype here is Union{Missing,T}).

Of course, I'd always be happy to improve performance if possible, specific suggestions and PR's are of course welcome. That said, reading arrays with missings will never be as fast as reading those without, so I don't actually see an issue here. Feel free to re-open this if there is a specific performance problem here.