[Python] Accessing parquet files with parquet.read_table in google cloud storage fails, but works with dataset, works in 16.1.0 fails in 17.0.0

brokenjacobs commented 2 months ago

Describe the bug, including details regarding any error messages, version, and platform.

In pyarrow 17.0.0

When accessing a parquet file using parquet.read_table an incompatible types exception is thrown:

>>> pa.parquet.read_table('gs://****/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1793, in read_table
    dataset = ParquetDataset(
              ^^^^^^^^^^^^^^^
  File "/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1371, in __init__
    self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 794, in dataset
    return _filesystem_dataset(source, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 486, in _filesystem_dataset
    return factory.finish(schema)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 3089, in pyarrow._dataset.DatasetFactory.finish
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Unable to merge: Field source_id has incompatible types: string vs dictionary<values=int32, indices=int32, ordered=0>

But accessing via dataset works:

>>> import pyarrow.dataset as ds
>>> df = ds.dataset('gs://***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet').to_table().to_pandas()
df
>>> df
      source_id site_id              readout_time   voltage
0          9319    SJER 2023-01-02 00:00:00+00:00  0.000159
1          9319    SJER 2023-01-02 00:00:01+00:00  0.000159
2          9319    SJER 2023-01-02 00:00:02+00:00  0.000160
3          9319    SJER 2023-01-02 00:00:03+00:00  0.000159
4          9319    SJER 2023-01-02 00:00:04+00:00  0.000157
...         ...     ...                       ...       ...
86395      9319    SJER 2023-01-02 23:59:55+00:00  0.000049
86396      9319    SJER 2023-01-02 23:59:56+00:00  0.000048
86397      9319    SJER 2023-01-02 23:59:57+00:00  0.000049
86398      9319    SJER 2023-01-02 23:59:58+00:00  0.000048
86399      9319    SJER 2023-01-02 23:59:59+00:00  0.000048

[86400 rows x 4 columns]
>>>

When I revert to pyarrow 16.1.0 both methods work:

>>> t = pa.parquet.read_table('gs://***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet')
>>> t.to_pandas()
      source_id site_id              readout_time   voltage
0          9319    SJER 2023-01-02 00:00:00+00:00  0.000159
1          9319    SJER 2023-01-02 00:00:01+00:00  0.000159
2          9319    SJER 2023-01-02 00:00:02+00:00  0.000160
3          9319    SJER 2023-01-02 00:00:03+00:00  0.000159
4          9319    SJER 2023-01-02 00:00:04+00:00  0.000157
...         ...     ...                       ...       ...
86395      9319    SJER 2023-01-02 23:59:55+00:00  0.000049
86396      9319    SJER 2023-01-02 23:59:56+00:00  0.000048
86397      9319    SJER 2023-01-02 23:59:57+00:00  0.000049
86398      9319    SJER 2023-01-02 23:59:58+00:00  0.000048
86399      9319    SJER 2023-01-02 23:59:59+00:00  0.000048

[86400 rows x 4 columns]

I've tried using the fs implementation to list the bucket in 17.0.0 and that works fine, I have no idea what is wrong here:

>>> from pyarrow import fs
>>> gcs = fs.GcsFileSystem()
>>> file_list = gcs.get_file_info(fs.FileSelector('***t/v1/li191r/ms=2023-01/source_id=9319/', recursive=False))
>>> file_list
[<FileInfo for '***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-01.parquet': type=FileType.File, size=418556>, <FileInfo for '***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet': type=FileType.File, size=401198>,  (and so on) ]

If I download the file locally and open it, it works. This same error also occurs in pandas > 2.0.0 with pandas.read_parquet()

Component(s)

Python

amoeba commented 1 month ago

Can you share the schema of the file here? pa.parquet.read_schema('gs://****/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet') should be enough.

amoeba commented 1 month ago

I suspect your Parquet file has a "source_id" column with type string, see

import os

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Setup
os.mkdir("dataset_root")
os.mkdir("dataset_root/source_id=9319")
tbl = pa.table(
    pd.DataFrame(
        {"source_id": ["9319", "9319", "9319"], "x": np.random.randint(0, 10, 3)}
    )
)
pq.write_table(tbl, "dataset_root/source_id=9319/li191r_9319_2023-01-02.parquet")

# This reproduces the issue
pq.read_table("dataset_root/source_id=9319/li191r_9319_2023-01-02.parquet")
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1793, in read_table
#     dataset = ParquetDataset(
#               ^^^^^^^^^^^^^^^
#   File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1371, in __init__
#     self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
#                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 794, in dataset
#     return _filesystem_dataset(source, **kwargs)
#            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 486, in _filesystem_dataset
#     return factory.finish(schema)
#            ^^^^^^^^^^^^^^^^^^^^^^
#   File "pyarrow/_dataset.pyx", line 3089, in pyarrow._dataset.DatasetFactory.finish
#   File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
#   File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
# pyarrow.lib.ArrowTypeError: Unable to merge: Field source_id has incompatible types: string vs dictionary<values=int32, indices=int32, ordered=0>

brokenjacobs commented 1 month ago

Can you share the schema of the file here? pa.parquet.read_schema('gs://****/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet') should be enough.
source_id: string
site_id: string
readout_time: timestamp[ms, tz=UTC]
voltage: float
kafka_key: string
kakfa_ts_type: uint8
kafka_ts: timestamp[ms]
kafka_partition: uint8
kafka_offset: uint64
kafka_topic: string
ds: string
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [], "columns": [{"name":' + 1502



I've also confirmed this bug on local filesystem as well as via cloud storage. And a good workaround is to pass `partitioning=none` to the read_table call.

brokenjacobs commented 1 month ago

FWIW we have other files with alphanumerics in that field as well.

amoeba commented 1 month ago

Thanks. Some thoughts:

read_table errors in your original code where ds.dataset does not because (1) read_table defaults to Hive partitioning and ds.dataset doesn't (2) your file contains a source_id field and its file path also includes source_id=X as a component of the path. With partitioned datasets, partition fields are usually omitted from the files themselves and I'm not sure what the behavior should be if the user leaves them in. The current behavior seems to be that the reader ignores the field in the file and trusts the partition field value in the file path.
ds.dataset succeeds because it's defaulting to Directory partitioning so it's totally ignoring the Hive partition scheme in your file path. You can make the ds.dataset call fail if you specify Hive partitioning (though with a slightly different error).

You have a few workarounds:

Remove the source_id field from your Parquet files. This is what I would do.

Manually specify a schema,

schm = pa.schema([pa.field("source_id", pa.string())])
pq.read_table("dataset_root/source_id=9319/li191r_9319_2023-01-02.parquet", schema=schm)

Manually specify partitoining=None

Is there a reason why (1) might not work for you?

apache / arrow

[Python] Accessing parquet files with parquet.read_table in google cloud storage fails, but works with dataset, works in 16.1.0 fails in 17.0.0 #43574

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)