dask / fastparquet

python implementation of the parquet columnar file format.
Apache License 2.0
772 stars 177 forks source link

Fix dt regression in empty() #898

Closed martindurant closed 11 months ago

martindurant commented 11 months ago

Fixes #897

martindurant commented 11 months ago

@jrbourbeau , I'll merge this when it passes, and that should be enough to make dask CI happy.

jrbourbeau commented 11 months ago

Thanks for fixing so quickly @martindurant!

Will there be a release out with this patch soon? We use releases in most CI build (one build uses main for fastparquet). If not, I'll just add some skip logic

martindurant commented 11 months ago

Will there be a release out with this patch soon

Yes, since the windows-py3.12 wheel failed to build in the last round anyway.

martindurant commented 11 months ago

@jrbourbeau , would you mind running your main-branch CI somewhere to see if the failures go away?

jrbourbeau commented 11 months ago

Locally I'm getting the same error

____________________________________________________________________________________________________________________________________ test_timestamp96 _____________________________________________________________________________________________________________________________________

tmpdir = local('/private/var/folders/h0/_w6tz8jd3b9bk6w7d_xpg9t40000gn/T/pytest-of-james/pytest-21/test_timestamp960')

    @FASTPARQUET_MARK
    def test_timestamp96(tmpdir):
        fn = str(tmpdir)
        df = pd.DataFrame({"a": [pd.to_datetime("now", utc=True)]})
        ddf = dd.from_pandas(df, 1)
        ddf.to_parquet(fn, engine="fastparquet", write_index=False, times="int96")
        pf = fastparquet.ParquetFile(fn)
        assert pf._schema[1].type == fastparquet.parquet_thrift.Type.INT96
>       out = dd.read_parquet(fn, engine="fastparquet", index=False).compute()

dask/dataframe/io/tests/test_parquet.py:1883:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
dask/base.py:342: in compute
    (result,) = compute(self, traverse=False, **kwargs)
dask/base.py:628: in compute
    results = schedule(dsk, keys, **kwargs)
dask/dataframe/io/parquet/core.py:96: in __call__
    return read_parquet_part(
dask/dataframe/io/parquet/core.py:654: in read_parquet_part
    dfs = [
dask/dataframe/io/parquet/core.py:655: in <listcomp>
    func(
dask/dataframe/io/parquet/fastparquet.py:1075: in read_partition
    return cls.pf_to_pandas(
dask/dataframe/io/parquet/fastparquet.py:1115: in pf_to_pandas
    df, views = pf.pre_allocate(size, columns, categories, index)
../../../mambaforge/envs/dask-py310/lib/python3.10/site-packages/fastparquet/api.py:797: in pre_allocate
    df, arrs = _pre_allocate(size, columns, categories, index, cats,
../../../mambaforge/envs/dask-py310/lib/python3.10/site-packages/fastparquet/api.py:1051: in _pre_allocate
    df, views = dataframe.empty(dtypes, size, cols=cols, index_names=index,
../../../mambaforge/envs/dask-py310/lib/python3.10/site-packages/fastparquet/dataframe.py:202: in empty
    values = type(bvalues)._from_sequence(values, copy=False, dtype=bvalues.dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

pandas/_libs/tslibs/tzconversion.pyx:187: ValueError

Note it looks like the line changed in this PR is similar, but not exactly the same, to the line where the error is being raised. Maybe both lines need the same sort of update

martindurant commented 11 months ago

What's your pandas version?

jrbourbeau commented 11 months ago
In [1]: import pandas as pd
pd
In [2]: pd.__version__
Out[2]: '1.5.3'
martindurant commented 11 months ago

OK, then I think all the pandas I have and in tests are too new... Hold on.