Open brokenjacobs opened 2 months ago
Can you share the schema of the file here? pa.parquet.read_schema('gs://****/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet')
should be enough.
I suspect your Parquet file has a "source_id" column with type string, see
import os
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
# Setup
os.mkdir("dataset_root")
os.mkdir("dataset_root/source_id=9319")
tbl = pa.table(
pd.DataFrame(
{"source_id": ["9319", "9319", "9319"], "x": np.random.randint(0, 10, 3)}
)
)
pq.write_table(tbl, "dataset_root/source_id=9319/li191r_9319_2023-01-02.parquet")
# This reproduces the issue
pq.read_table("dataset_root/source_id=9319/li191r_9319_2023-01-02.parquet")
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1793, in read_table
# dataset = ParquetDataset(
# ^^^^^^^^^^^^^^^
# File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1371, in __init__
# self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 794, in dataset
# return _filesystem_dataset(source, **kwargs)
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# File "/Users/bryce/Work/GH-43574/venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 486, in _filesystem_dataset
# return factory.finish(schema)
# ^^^^^^^^^^^^^^^^^^^^^^
# File "pyarrow/_dataset.pyx", line 3089, in pyarrow._dataset.DatasetFactory.finish
# File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
# File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
# pyarrow.lib.ArrowTypeError: Unable to merge: Field source_id has incompatible types: string vs dictionary<values=int32, indices=int32, ordered=0>
Can you share the schema of the file here?
pa.parquet.read_schema('gs://****/v1/li191r/ms=2023-01/source_id=9319/li191r_9319_2023-01-02.parquet')
should be enough.source_id: string site_id: string readout_time: timestamp[ms, tz=UTC] voltage: float kafka_key: string kakfa_ts_type: uint8 kafka_ts: timestamp[ms] kafka_partition: uint8 kafka_offset: uint64 kafka_topic: string ds: string -- schema metadata -- pandas: '{"index_columns": [], "column_indexes": [], "columns": [{"name":' + 1502
I've also confirmed this bug on local filesystem as well as via cloud storage. And a good workaround is to pass `partitioning=none` to the read_table call.
FWIW we have other files with alphanumerics in that field as well.
Thanks. Some thoughts:
read_table
errors in your original code where ds.dataset
does not because (1) read_table
defaults to Hive partitioning and ds.dataset
doesn't (2) your file contains a source_id
field and its file path also includes source_id=X
as a component of the path. With partitioned datasets, partition fields are usually omitted from the files themselves and I'm not sure what the behavior should be if the user leaves them in. The current behavior seems to be that the reader ignores the field in the file and trusts the partition field value in the file path.ds.dataset
succeeds because it's defaulting to Directory partitioning so it's totally ignoring the Hive partition scheme in your file path. You can make the ds.dataset
call fail if you specify Hive partitioning (though with a slightly different error).You have a few workarounds:
source_id
field from your Parquet files. This is what I would do.schm = pa.schema([pa.field("source_id", pa.string())])
pq.read_table("dataset_root/source_id=9319/li191r_9319_2023-01-02.parquet", schema=schm)
partitoining=None
Is there a reason why (1) might not work for you?
Describe the bug, including details regarding any error messages, version, and platform.
In pyarrow 17.0.0
When accessing a parquet file using parquet.read_table an incompatible types exception is thrown:
But accessing via dataset works:
When I revert to pyarrow 16.1.0 both methods work:
I've tried using the fs implementation to list the bucket in 17.0.0 and that works fine, I have no idea what is wrong here:
If I download the file locally and open it, it works. This same error also occurs in pandas > 2.0.0 with pandas.read_parquet()
Component(s)
Python