apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.48k stars 3.52k forks source link

[Python] Accessing parquet files with parquet.read_table in google cloud storage fails, but works with dataset, works in 16.1.0 fails in 17.0.0 #43574

Open brokenjacobs opened 2 months ago

brokenjacobs commented 2 months ago

Describe the bug, including details regarding any error messages, version, and platform.

In pyarrow 17.0.0

When accessing a parquet file using parquet.read_table an incompatible types exception is thrown:

>>> pa.parquet.read_table('gs://****/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1793, in read_table
    dataset = ParquetDataset(
              ^^^^^^^^^^^^^^^
  File "/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1371, in __init__
    self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 794, in dataset
    return _filesystem_dataset(source, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sjacobs/tmp/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 486, in _filesystem_dataset
    return factory.finish(schema)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 3089, in pyarrow._dataset.DatasetFactory.finish
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Unable to merge: Field source_id has incompatible types: string vs dictionary<values=int32, indices=int32, ordered=0>

But accessing via dataset works:

>>> import pyarrow.dataset as ds
>>> df = ds.dataset('gs://***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet').to_table().to_pandas()
df
>>> df
      source_id site_id              readout_time   voltage
0          9319    SJER 2023-01-02 00:00:00+00:00  0.000159
1          9319    SJER 2023-01-02 00:00:01+00:00  0.000159
2          9319    SJER 2023-01-02 00:00:02+00:00  0.000160
3          9319    SJER 2023-01-02 00:00:03+00:00  0.000159
4          9319    SJER 2023-01-02 00:00:04+00:00  0.000157
...         ...     ...                       ...       ...
86395      9319    SJER 2023-01-02 23:59:55+00:00  0.000049
86396      9319    SJER 2023-01-02 23:59:56+00:00  0.000048
86397      9319    SJER 2023-01-02 23:59:57+00:00  0.000049
86398      9319    SJER 2023-01-02 23:59:58+00:00  0.000048
86399      9319    SJER 2023-01-02 23:59:59+00:00  0.000048

[86400 rows x 4 columns]
>>>

When I revert to pyarrow 16.1.0 both methods work:

>>> t = pa.parquet.read_table('gs://***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet')
>>> t.to_pandas()
      source_id site_id              readout_time   voltage
0          9319    SJER 2023-01-02 00:00:00+00:00  0.000159
1          9319    SJER 2023-01-02 00:00:01+00:00  0.000159
2          9319    SJER 2023-01-02 00:00:02+00:00  0.000160
3          9319    SJER 2023-01-02 00:00:03+00:00  0.000159
4          9319    SJER 2023-01-02 00:00:04+00:00  0.000157
...         ...     ...                       ...       ...
86395      9319    SJER 2023-01-02 23:59:55+00:00  0.000049
86396      9319    SJER 2023-01-02 23:59:56+00:00  0.000048
86397      9319    SJER 2023-01-02 23:59:57+00:00  0.000049
86398      9319    SJER 2023-01-02 23:59:58+00:00  0.000048
86399      9319    SJER 2023-01-02 23:59:59+00:00  0.000048

[86400 rows x 4 columns]

I've tried using the fs implementation to list the bucket in 17.0.0 and that works fine, I have no idea what is wrong here:

>>> from pyarrow import fs
>>> gcs = fs.GcsFileSystem()
>>> file_list = gcs.get_file_info(fs.FileSelector('***t/v1/li191r/ms=2023-01/source_id=9319/', recursive=False))
>>> file_list
[<FileInfo for '***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-01.parquet': type=FileType.File, size=418556>, <FileInfo for '***/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet': type=FileType.File, size=401198>,  (and so on) ]

If I download the file locally and open it, it works. This same error also occurs in pandas > 2.0.0 with pandas.read_parquet()

Component(s)

Python

amoeba commented 1 month ago

Can you share the schema of the file here? pa.parquet.read_schema('gs://****/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet') should be enough.

amoeba commented 1 month ago

I suspect your Parquet file has a "source_id" column with type string, see

import os

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Setup
os.mkdir("dataset_root")
os.mkdir("dataset_root/source_id=9319")
tbl = pa.table(
    pd.DataFrame(
        {"source_id": ["9319", "9319", "9319"], "x": np.random.randint(0, 10, 3)}
    )
)
pq.write_table(tbl, "dataset_root/source_id=9319/li191r_9319_2023-01-02.parquet")

# This reproduces the issue
pq.read_table("dataset_root/source_id=9319/li191r_9319_2023-01-02.parquet")
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1793, in read_table
#     dataset = ParquetDataset(
#               ^^^^^^^^^^^^^^^
#   File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1371, in __init__
#     self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
#                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 794, in dataset
#     return _filesystem_dataset(source, **kwargs)
#            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 486, in _filesystem_dataset
#     return factory.finish(schema)
#            ^^^^^^^^^^^^^^^^^^^^^^
#   File "pyarrow/_dataset.pyx", line 3089, in pyarrow._dataset.DatasetFactory.finish
#   File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
#   File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
# pyarrow.lib.ArrowTypeError: Unable to merge: Field source_id has incompatible types: string vs dictionary<values=int32, indices=int32, ordered=0>
brokenjacobs commented 1 month ago

Can you share the schema of the file here? pa.parquet.read_schema('gs://****/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet') should be enough.


source_id: string
site_id: string
readout_time: timestamp[ms, tz=UTC]
voltage: float
kafka_key: string
kakfa_ts_type: uint8
kafka_ts: timestamp[ms]
kafka_partition: uint8
kafka_offset: uint64
kafka_topic: string
ds: string
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [], "columns": [{"name":' + 1502


I've also confirmed this bug on local filesystem as well as via cloud storage. And a good workaround is to pass `partitioning=none` to the read_table call. 
brokenjacobs commented 1 month ago

FWIW we have other files with alphanumerics in that field as well.

amoeba commented 1 month ago

Thanks. Some thoughts:

You have a few workarounds:

  1. Remove the source_id field from your Parquet files. This is what I would do.
  2. Manually specify a schema,
    schm = pa.schema([pa.field("source_id", pa.string())])
    pq.read_table("dataset_root/source_id=9319/li191r_9319_2023-01-02.parquet", schema=schm)
  3. Manually specify partitoining=None

Is there a reason why (1) might not work for you?