dask / fastparquet

python implementation of the parquet columnar file format.
Apache License 2.0
787 stars 178 forks source link

Some `fastparquet`-related tests are failing on Python 3.10 #896

Open jrbourbeau opened 1 year ago

jrbourbeau commented 1 year ago

I've seen

FAILED dask/dataframe/io/tests/test_parquet.py::test_roundtrip[fastparquet-df12-write_kwargs12-read_kwargs12] - ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
FAILED dask/dataframe/io/tests/test_parquet.py::test_roundtrip[fastparquet-df13-write_kwargs13-read_kwargs13] - ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
FAILED dask/dataframe/io/tests/test_parquet.py::test_timestamp96 - ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
FAILED dask/dataframe/io/tests/test_parquet.py::test_with_tz[fastparquet] - ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

with tracebacks like this

_______________________________ test_timestamp96 _______________________________
[gw1] linux -- Python 3.10.12 /usr/share/miniconda3/envs/test-environment/bin/python3.10

tmpdir = local('/tmp/pytest-of-runner/pytest-0/popen-gw1/test_timestamp960')

    @FASTPARQUET_MARK
    def test_timestamp96(tmpdir):
        fn = str(tmpdir)
        df = pd.DataFrame({"a": [pd.to_datetime("now", utc=True)]})
        ddf = dd.from_pandas(df, 1)
        ddf.to_parquet(fn, engine="fastparquet", write_index=False, times="int96")
        pf = fastparquet.ParquetFile(fn)
        assert pf._schema[1].type == fastparquet.parquet_thrift.Type.INT96
>       out = dd.read_parquet(fn, engine="fastparquet", index=False).compute()

dask/dataframe/io/tests/test_parquet.py:1883: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
dask/base.py:342: in compute
    (result,) = compute(self, traverse=False, **kwargs)
dask/base.py:628: in compute
    results = schedule(dsk, keys, **kwargs)
dask/dataframe/io/parquet/core.py:96: in __call__
    return read_parquet_part(
dask/dataframe/io/parquet/core.py:654: in read_parquet_part
    dfs = [
dask/dataframe/io/parquet/core.py:655: in <listcomp>
    func(
dask/dataframe/io/parquet/fastparquet.py:1075: in read_partition
    return cls.pf_to_pandas(
dask/dataframe/io/parquet/fastparquet.py:1115: in pf_to_pandas
    df, views = pf.pre_allocate(size, columns, categories, index)
/usr/share/miniconda3/envs/test-environment/lib/python3.10/site-packages/fastparquet/api.py:797: in pre_allocate
    df, arrs = _pre_allocate(size, columns, categories, index, cats,
/usr/share/miniconda3/envs/test-environment/lib/python3.10/site-packages/fastparquet/api.py:1051: in _pre_allocate
    df, views = dataframe.empty(dtypes, size, cols=cols, index_names=index,
/usr/share/miniconda3/envs/test-environment/lib/python3.10/site-packages/fastparquet/dataframe.py:202: in empty
    values = type(bvalues)._from_sequence(values, copy=False, dtype=bvalues.dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

pandas/_libs/tslibs/tzconversion.pyx:187: ValueError

showing up this morning on multiple PRs. See this CI build for full details.

Note all the errors involve fastparquet, which had a release yesterday. @martindurant any idea what might be happening here?

martindurant commented 1 year ago

Transferring to fastparquet, but will keep you in the loop @jrbourbeau

martindurant commented 1 year ago

(actually, I can't transfer, will duplicate)

jrbourbeau commented 1 year ago

Just transferred over

martindurant commented 1 year ago

Regression due to https://github.com/dask/fastparquet/pull/893 @jbrockmendel

martindurant commented 1 year ago

Note that the same tests did pass in fastparequet's CI: e.g. https://github.com/dask/fastparquet/actions/runs/6615631492/job/17968182303#step:6:83 Maybe we have different versions of pandas?

jbrockmendel commented 1 year ago

This surfaces a bug upstream that i'll work on. Fortunately its easy to work around here. in #893 instead of passing dt64 values pass int64 values to _from_sequence. That will also be more performant.

martindurant commented 1 year ago
values = type(bvalues)._from_sequence(values.view("int64"), copy=False, dtype=bvalues.dtype)

?

I am puzzled why only this invocation of the same method would need this, but if you say so...

jbrockmendel commented 1 year ago

I am puzzled why only this invocation of the same method would need this, but if you say so...

You are not alone in this. The API design question from ages ago was: "when passing dt64 values and a pd.DatetimeTZDtype to DatetimeIndex (which has the same behavior as _from_sequence here), do we interpret them as wall-times or UTC times?" We eventually landed on wall-times, while i8 values get interpeted as UTC times. wall times need to go through a cython function that converts the to UTC times. It is that cython function that is raising.

mrocklin commented 1 year ago

Dask CI continues to fail during this period. Should we xfail these tests in the meantime?

jrbourbeau commented 1 year ago

I believe a new fastparquet release is imminent after https://github.com/dask/fastparquet/pull/899 is merged (though I don't object to xfail either)