Open asfimport opened 2 years ago
This seems to be related to the partition_cols
- if you comment this line from write_to_dataset
the error is suppressed. I cannot find an example of writing metadata for a partitioned dataset?
I have the same problem, for datasets in which the schema of the parquet files are identical expect the ordering of the columns.
That means for me, currently, I have to rewrite all parquet files with one unique schema (same column ordering). I wonder, if it is necessary, that the ordering of the columns is identical.
@legout Can you show the error you meet and the code you're using when using dataset writer? Seems that when writing same file the schema should be same, but I don't fully understand how you meet this when using dataset api.
Sorry for my confusing comment. Here are some more details.
The parquet files of the dataset are exports from an oracle database written with another software(knime). Unfortunately, this leads to the parquet files having different column ordering, although the data types of the columns are identical.
This means, I am able to read the dataset (parquet files) using pyarrow.dataset or pyarrow.read_table. However, when trying to create metadata and common metadata files according to https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files, I get the this error
RuntimeError: AppendRowGroups requires equal schemas.
I do understand, that data types have to be identical, but I wonder why the column ordering is important here.
I am currently on my mobile. I'll provide some sample code later.
Create toy dataset with parquet files having identical column types, but different column ordering.
import os
import tempfile
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as pds
t1 = pa.Table.from_pydict({"A": [1, 2, 3], "B": ["a", "b", "c"]})
t2 = pa.Table.from_pydict({"B": ["a", "b", "c"], "A": [1, 2, 3]})
temp_path = tempfile.mkdtemp()
pq.write_table(t1, os.path.join(temp_path, "t1.parquet"))
pq.write_table(t2, os.path.join(temp_path, "t2.parquet"))
ds = pds.dataset(temp_path)
print(ds.to_table())
pyarrow.Table
A: int64
B: string
----
A: [[1,2,3],[1,2,3]]
B: [["a","b","c"],["a","b","c"]]
Collect metadata of the individual files and create the (global) metadata file.
metadata_collector = [frag.metadata for frag in ds.get_fragments()]
metadata = metadata_collector[0]
metadata.append_row_groups(metadata_collector[1])
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[193], line 2
1 metadata = metadata_collector[0]
----> 2 metadata.append_row_groups(metadata_collector[1])
File ~/mambaforge/envs/pydala-dev/lib/python3.11/site-packages/pyarrow/_parquet.pyx:793, in pyarrow._parquet.FileMetaData.append_row_groups()
RuntimeError: AppendRowGroups requires equal schemas.
The two columns with index 0 differ.
column descriptor = {
name: A,
path: A,
physical_type: INT64,
converted_type: NONE,
logical_type: None,
max_definition_level: 1,
max_repetition_level: 0,
}
column descriptor = {
name: B,
path: B,
physical_type: BYTE_ARRAY,
converted_type: UTF8,
logical_type: String,
max_definition_level: 1,
max_repetition_level: 0,
}
>>> metadata_collector[0].schema
<pyarrow._parquet.ParquetSchema object at 0x11e3cee80>
required group field_id=-1 schema {
optional int64 field_id=-1 A;
optional binary field_id=-1 B (String);
}
>>> metadata_collector[1].schema
<pyarrow._parquet.ParquetSchema object at 0x11e3ceec0>
required group field_id=-1 schema {
optional binary field_id=-1 B (String);
optional int64 field_id=-1 A;
}
Oh I got this. This is not allowed. Though it looks like it should be allowed.
Because Parquet schema is at "FileMetadata" ( see https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1024 ), so different row-group should have same schema.
This means, there is no other solution than rewriting the data with a unique column ordering?
I have a similar issue.
pyarrow 14.0.2
parquet.write_metadata(
File ".../lib/python3.9/site-packages/pyarrow/parquet/core.py", line 3589, in write_metadata
metadata.append_row_groups(m)
File "pyarrow/_parquet.pyx", line 807, in pyarrow._parquet.FileMetaData.append_row_groups
RuntimeError: AppendRowGroups requires equal schemas.
The two columns with index 0 differ.
column descriptor = {
name: session_num,
path: session_num,
physical_type: INT64,
converted_type: UINT_64,
logical_type: Int(bitWidth=64, isSigned=false),
max_definition_level: 0,
max_repetition_level: 0,
}
column descriptor = {
name: session_num,
path: session_num,
physical_type: INT64,
converted_type: UINT_64,
logical_type: Int(bitWidth=64, isSigned=false),
max_definition_level: 1,
max_repetition_level: 0,
}
All partitions have equal schemas. Example are taken from https://arrow.apache.org/docs/14.0/python/parquet.html#writing-metadata-and-common-metadata-files
When all fields are nullable in the schema this error does not occur. I think it relates with https://github.com/apache/arrow/issues/31957
I'm trying to follow the example here: https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files to write an example partitioned dataset. But I'm consistently getting an error about non-equal schemas. Here's a mcve:
This raises the error
But all schemas in the
metadata_collector
list seem to be the same:Environment: MacOS. Python 3.8.10. pyarrow: '7.0.0' pandas: '1.4.2' numpy: '1.22.3' Reporter: Kyle Barron
Note: This issue was originally created as ARROW-16287. Please see the migration documentation for further details.