JuliaIO / Parquet.jl

Julia implementation of Parquet columnar file format reader
Other
119 stars 32 forks source link

string columns with missing are read back at `Int32` #82

Closed xiaodaigh closed 4 years ago

xiaodaigh commented 4 years ago

In Python

import pandas as pd
import numpy as np

pd.DataFrame({"a":["abc", np.nan, "def"]}).to_parquet("somewhere.parquet")

in Julia on the master branch

pf = ParFile("somewhere")

# the file is very small so only one rowgroup
col_chunks = columns(pf, 1)

colnum = 1
col_chunk=col_chunks[colnum]

correct_vals = tbl[colnum]
coltype = eltype(correct_vals)
vals_from_file = values(pf, col_chunk)

and you will see vals_from_file[1] are Int32 instead of Vector{UInt8}.

The same data can be read in R and Python