dask / fastparquet

python implementation of the parquet columnar file format.
Apache License 2.0
787 stars 178 forks source link

When changing to a larger dtype, its size must be a advisor of the total size in bytes of the last axis of the array #911

Closed tking320 closed 11 months ago

tking320 commented 11 months ago

Describe the issue:

When I use fastparquet to convert dataframe, an error occurs: When changing to a larger dtype, its size must be a advisor of the total size in bytes of the last axis of the array

Minimal Complete Verifiable Example:

import fastparquet
pf=fastparquet.ParquetFile('/tmp/can_129_1.parquet')
pf.to_pandas()

image

Anything else we need to know?:

Environment: linux

martindurant commented 11 months ago

This is not much of a reproducible example, I'm afraid. Can you at least show the schema of the file. How was it made? Which column causes the issue?

tking320 commented 11 months ago

This is the file that caused the error, you can try it.

test.parquet.tar.gz

martindurant commented 11 months ago

This is the first time we have seen DELTA encoding (a parquet V2 feature) in a v1 data page. This should be fixable relatively easily, please stay tuned.

martindurant commented 11 months ago

It would be useful if you could let us know the expected values of the "ts" column, as a test.

tking320 commented 11 months ago

It would be useful if you could let us know the expected values of the "ts" column, as a test. you can use pandas to parse the file and the result will be correct


from pandas import read_parquet

df = read_parquet(f) df['ts'] = data['ts'].astype(str) df['upload_ts'] = data['upload_ts'].astype(str)

print(df)

martindurant commented 11 months ago

That makes sense :). I was wrong in my initial assertion, we do handle DELTA already, but only for 32bit types (int, normally), so need to extend to 64bit.