dask / fastparquet

python implementation of the parquet columnar file format.
Apache License 2.0
772 stars 177 forks source link

When changing to a larger dtype, its size must be a advisor of the total size in bytes of the last axis of the array #911

Closed tking320 closed 9 months ago

tking320 commented 9 months ago

Describe the issue:

When I use fastparquet to convert dataframe, an error occurs: When changing to a larger dtype, its size must be a advisor of the total size in bytes of the last axis of the array

Minimal Complete Verifiable Example:

import fastparquet
pf=fastparquet.ParquetFile('/tmp/can_129_1.parquet')
pf.to_pandas()

image

Anything else we need to know?:

Environment: linux

martindurant commented 9 months ago

This is not much of a reproducible example, I'm afraid. Can you at least show the schema of the file. How was it made? Which column causes the issue?

tking320 commented 9 months ago

This is the file that caused the error, you can try it.

test.parquet.tar.gz

martindurant commented 9 months ago

This is the first time we have seen DELTA encoding (a parquet V2 feature) in a v1 data page. This should be fixable relatively easily, please stay tuned.

martindurant commented 9 months ago

It would be useful if you could let us know the expected values of the "ts" column, as a test.

tking320 commented 9 months ago

It would be useful if you could let us know the expected values of the "ts" column, as a test. you can use pandas to parse the file and the result will be correct


from pandas import read_parquet

df = read_parquet(f) df['ts'] = data['ts'].astype(str) df['upload_ts'] = data['upload_ts'].astype(str)

print(df)

martindurant commented 9 months ago

That makes sense :). I was wrong in my initial assertion, we do handle DELTA already, but only for 32bit types (int, normally), so need to extend to 64bit.