Closed yalwan-iqvia closed 4 years ago
Actually I can add some more piece of information -- the columns which decoded correctly had no missing values Also this:
julia> findmax(sum(z .!= 0.0; dims = 2) .> 0)
(true, CartesianIndex(240654, 1))
julia>
The above shows that for each column data stays same up to at least row 240654. but this is with nans replaced
so lets do this
julia> submat = Float64.(coalesce.(Matrix(pqdata), nothing)) - Float64.(Matrix(jrefdata))
julia> findmax(sum((submat .!= 0.0) .& (.!isnan.(submat)); dims = 2) .> 0)
(true, CartesianIndex(240654, 1))
julia>
Seems like the same answer -- so just to be careful of red herring above where the first two columns were all correct in the absence of missing data and all data for the other columns was wrong, it seems I did a silly analysis there.
You can actually use the following python code to generate some data which will reproduce this issue:
import numpy.random
import pandas
rand = numpy.random.randn(2_000_000, 3)
x = pandas.DataFrame(rand)
for col in x.columns:
x.loc[x.sample(frac=0.8).index, col] = numpy.nan
x.columns = ['a', 'b', 'c']
x.to_parquet('mydata.parquet')
Hello @tanmaykm -- I'm pinging you directly because it looks like you're a maintainer for this project. Could you possibly comment on when you might be able to get around to taking a look at this?
@yalwan-iqvia are you able to provide some funding or request funding from your organisation? I wrote the parquet writer and has some familiarity with the parquet code. Please DM me to discuss further if you wish.
@yalwan-iqvia could you try the master branch and see if it works for you?
@yalwan-iqvia Actually it does seem fine on the master branch.
I created a test parquet file using the code posted by you. I could then read it from Julia and exported that into a CSV. Similarly I read the same parquet file using pandas and exported that into a CSV as well. Comparing cell by cell, both CSV files did turn out to be identical.
@tanmaykm Thanks for the response!
I just tested locally with some large-ish data and it does indeed look good. I will do some more tests with larger data sets and report back.
Is there a plan for a new version release soon?
Looks good with 5m rows x 95 cols (pushing the boundaries of what I could test locally)
Great! I think we can tag a release now, maybe this week.
I try to give as much information as possible. Unfortunately I am not able to supply data file which produces this issue (proprietary and confidential), but I will try in the coming time to produce a garbage file which can reproduce this issue.
Julia version = 1.4.1 Parquet.jl version = v0.6.1
I have used a program like this to assist with understanding the issue:
To summarise, I found that NaNs/missing data (Pandas fault) were in the same places, but actual values started to skew off I can give statistics on the data (which might help you in building an understanding, and will help me later to build a reproducing case)
I want to note that float point precision issues aren't a possible cause here (firstly because the encoded/decoded outputs should match exactly, and secondly because the actual underlying pqfile is all int32s, rather than actual floats, so the upcasting and subtractions should yield exact zeros)
Just to demonstrate this:
This also demonstrates in particular that every column encoded after the second is wrong in all its nonzero values.
I am not sure what more information I can provide at this point to help, but I would like to help as much as I can, so if you would have any questions do let me know and I will try to respond quickly.