Corrupt reads with large data files

yalwan-iqvia commented 4 years ago

I try to give as much information as possible. Unfortunately I am not able to supply data file which produces this issue (proprietary and confidential), but I will try in the coming time to produce a garbage file which can reproduce this issue.

Julia version = 1.4.1 Parquet.jl version = v0.6.1

I have used a program like this to assist with understanding the issue:

import PyCall
import Parquet
import DataFrames

np = PyCall.pyimport("numpy")
pandas = PyCall.pyimport("pandas")
parquet = PyCall.pyimport("pyarrow.parquet")

function pd_to_df(df_pd)
    df= DataFrames.DataFrame()
    for col in df_pd.columns
        df[!, Symbol(col)] = getproperty(df_pd, col).values
    end
    df
end

function raw_parquet(path::String; logical_map=true)::DataFrames.DataFrame
    p = Parquet.ParFile(path; map_logical_types = logical_map)
    colcursor = Parquet.BatchedColumnsCursor(p)
    # Create a DataFrame per batch and bring the DataFrames together
    df = reduce(vcat, DataFrames.DataFrame.(colcursor))
    return df
end

pqdata = raw_parquet("mydata.parquet")
refdata = pandas.read_parquet("mydata.parquet")
jrefdata = pd_to_df(refdata)

Core.Float64(x::Nothing) = NaN

# empties are in the same place
isnan.(Float64.(coalesce.(Matrix(pqdata), nothing))) == isnan.(Float64.(Matrix(jrefdata)))

z = Float64.(coalesce.(Matrix(pqdata), nothing)) - Float64.(Matrix(jrefdata))
z[isnan.(z)] .= 0.0

To summarise, I found that NaNs/missing data (Pandas fault) were in the same places, but actual values started to skew off I can give statistics on the data (which might help you in building an understanding, and will help me later to build a reproducing case)

julia> sum(z .!= 0.0; dims = 1)
1×281 Array{Int64,2}:
 0  0  80072  80023  79068  311913  313816  309042  6590  6240  5987  73731  73474  71966  2620  2615  2590  …  433321  436259  428708  28396  27692  25520  59024  58439  58015  177502  177212  175058  18251  18206  18107

julia> size(z)
(1443915, 281)

julia> sum(sum(z .!= 0.0; dims = 1) .> 0)
279

shell> du -sh mydata.parquet
110M    mydata.parquet

julia> sum(.!isnan.(Float64.(Matrix(jrefdata)))) / prod(size(z))
0.0876967834447427
julia> sum(z .!= 0.0) / prod(size(z))
0.0658725672220012
julia>

I want to note that float point precision issues aren't a possible cause here (firstly because the encoded/decoded outputs should match exactly, and secondly because the actual underlying pqfile is all int32s, rather than actual floats, so the upcasting and subtractions should yield exact zeros)

Just to demonstrate this:

julia> zz = copy(z);
julia> zz[zz .== 0] .= 1e6; # set a big number to not affect minimums
julia> minimum(abs.(zz); dims=1)
1×281 Array{Float64,2}:
 1.0e6  1.0e6  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  2.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
julia> minimum(minimum(abs.(zz); dims=1))
1.0
julia>

This also demonstrates in particular that every column encoded after the second is wrong in all its nonzero values.

I am not sure what more information I can provide at this point to help, but I would like to help as much as I can, so if you would have any questions do let me know and I will try to respond quickly.

yalwan-iqvia commented 4 years ago

Actually I can add some more piece of information -- the columns which decoded correctly had no missing values Also this:

julia> findmax(sum(z .!= 0.0; dims = 2) .> 0)
(true, CartesianIndex(240654, 1))
julia>

The above shows that for each column data stays same up to at least row 240654. but this is with nans replaced

so lets do this

julia> submat = Float64.(coalesce.(Matrix(pqdata), nothing)) - Float64.(Matrix(jrefdata))
julia> findmax(sum((submat .!= 0.0) .& (.!isnan.(submat)); dims = 2) .> 0)
(true, CartesianIndex(240654, 1))
julia>

Seems like the same answer -- so just to be careful of red herring above where the first two columns were all correct in the absence of missing data and all data for the other columns was wrong, it seems I did a silly analysis there.

yalwan-iqvia commented 4 years ago

You can actually use the following python code to generate some data which will reproduce this issue:

import numpy.random                                                                                                                                                                                                                  
import pandas                                                                                                                                                                                                                        
rand = numpy.random.randn(2_000_000, 3)                                                                                                                                                                                              
x = pandas.DataFrame(rand)                                                                                                                                                                                                           
for col in x.columns:  
    x.loc[x.sample(frac=0.8).index, col] = numpy.nan 

x.columns = ['a', 'b', 'c']                                                                                                                                                                                                          
x.to_parquet('mydata.parquet')

yalwan-iqvia commented 4 years ago

Hello @tanmaykm -- I'm pinging you directly because it looks like you're a maintainer for this project. Could you possibly comment on when you might be able to get around to taking a look at this?

xiaodaigh commented 4 years ago

@yalwan-iqvia are you able to provide some funding or request funding from your organisation? I wrote the parquet writer and has some familiarity with the parquet code. Please DM me to discuss further if you wish.

tanmaykm commented 4 years ago

@yalwan-iqvia could you try the master branch and see if it works for you?

tanmaykm commented 4 years ago

@yalwan-iqvia Actually it does seem fine on the master branch.

I created a test parquet file using the code posted by you. I could then read it from Julia and exported that into a CSV. Similarly I read the same parquet file using pandas and exported that into a CSV as well. Comparing cell by cell, both CSV files did turn out to be identical.

yalwan-iqvia commented 4 years ago

@tanmaykm Thanks for the response!

I just tested locally with some large-ish data and it does indeed look good. I will do some more tests with larger data sets and report back.

Is there a plan for a new version release soon?

yalwan-iqvia commented 4 years ago

Looks good with 5m rows x 95 cols (pushing the boundaries of what I could test locally)

tanmaykm commented 4 years ago

Great! I think we can tag a release now, maybe this week.

JuliaIO / Parquet.jl

Corrupt reads with large data files #105