Closed kcullimore closed 1 week ago
I think this same as: https://github.com/apache/arrow/issues/43908
Thanks @mapleFU,
I do see the similar issue being discussed further down in the comments. I've switched to using JSON for my immediate needs.
Closing this since its a duplicate of the evolving #43908 issue and python-bigquery-#2008 issue.
Describe the enhancement requested
I’m not sure if the behavior described below is expected and I'm just missing something or is a bug.
When uploading a Parquet file created with PyArrow to Google BigQuery, columns containing simple lists (e.g., List[str], List[int], List[float]) are interpreted by BigQuery as RECORD types with REPEATED mode instead of the expected primitive types (STRING, INTEGER, FLOAT) with REPEATED mode.
The example input schema is:
After uploading to a BigQuery table via a parquet file it returns the following schema (after querying and converting back to an arrow table):
I've tried explicitly defining the schema in BigQuery and ensuring that the Parquet file’s schema matches but the behavior persists.
I have an alternative workaround in mind (via JSON) but would prefer to continue using PyArrow and parquet.
Example Code
To reproduce create a Parquet file using PyArrow that includes some columns with lists of integers, strings, and floats. Upload this Parquet file to BigQuery via a bucket and inspect the table schema and field values.
I would expect BigQuery to recognize the
int_column
,str_column
, andfloat_column
as arrays of integers, strings, and floats respectively (with REPEATED mode). However, it interprets these columns as RECORD types with REPEATED mode which complicates the data handling.Environment:
• Python 3.11.10 • Ubuntu 22.04.5 • pyarrow==18.0.0 • google-cloud-bigquery==3.26.0 • google-cloud-storage==2.18.2
Component(s)
Python