apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.6k stars 3.54k forks source link

[Python] PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file #31678

Open asfimport opened 2 years ago

asfimport commented 2 years ago

I'm trying to follow the example here: https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files to write an example partitioned dataset. But I'm consistently getting an error about non-equal schemas. Here's a mcve:


from pathlib import Path
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
size = 100_000_000
partition_col = np.random.randint(0, 10, size)
values = np.random.rand(size)
table = pa.Table.from_pandas(
    pd.DataFrame({"partition_col": partition_col, "values": values})
)
metadata_collector = []
root_path = Path("random.parquet")
pq.write_to_dataset(
    table,
    root_path,
    partition_cols=["partition_col"],
    metadata_collector=metadata_collector,
)

Write the ``_common_metadata`` parquet file without row groups statistics
pq.write_metadata(table.schema, root_path / "_common_metadata")

Write the ``_metadata`` parquet file with row groups statistics of all files
pq.write_metadata(
    table.schema, root_path / "_metadata", metadata_collector=metadata_collector
) 

This raises the error


---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [92], in <cell line: 1>()
----> 1 pq.write_metadata(
      2     table.schema, root_path / "_metadata", metadata_collector=metadata_collector
      3 )
File ~/tmp/env/lib/python3.8/site-packages/pyarrow/parquet.py:2324, in write_metadata(schema, where, metadata_collector, **kwargs)
   2322 metadata = read_metadata(where)
   2323 for m in metadata_collector:
-> 2324     metadata.append_row_groups(m)
   2325 metadata.write_metadata_file(where)
File ~/tmp/env/lib/python3.8/site-packages/pyarrow/_parquet.pyx:628, in pyarrow._parquet.FileMetaData.append_row_groups()
RuntimeError: AppendRowGroups requires equal schemas. 

But all schemas in the metadata_collector list seem to be the same:


all(metadata_collector[0].schema == meta.schema for meta in metadata_collector)
# True 

Environment: MacOS. Python 3.8.10. pyarrow: '7.0.0' pandas: '1.4.2' numpy: '1.22.3' Reporter: Kyle Barron

Note: This issue was originally created as ARROW-16287. Please see the migration documentation for further details.

david-waterworth commented 1 year ago

This seems to be related to the partition_cols - if you comment this line from write_to_dataset the error is suppressed. I cannot find an example of writing metadata for a partitioned dataset?

legout commented 1 year ago

I have the same problem, for datasets in which the schema of the parquet files are identical expect the ordering of the columns.

That means for me, currently, I have to rewrite all parquet files with one unique schema (same column ordering). I wonder, if it is necessary, that the ordering of the columns is identical.

mapleFU commented 1 year ago

@legout Can you show the error you meet and the code you're using when using dataset writer? Seems that when writing same file the schema should be same, but I don't fully understand how you meet this when using dataset api.

legout commented 1 year ago

Sorry for my confusing comment. Here are some more details.

The parquet files of the dataset are exports from an oracle database written with another software(knime). Unfortunately, this leads to the parquet files having different column ordering, although the data types of the columns are identical.

This means, I am able to read the dataset (parquet files) using pyarrow.dataset or pyarrow.read_table. However, when trying to create metadata and common metadata files according to https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files, I get the this error

RuntimeError: AppendRowGroups requires equal schemas.

I do understand, that data types have to be identical, but I wonder why the column ordering is important here.

I am currently on my mobile. I'll provide some sample code later.

legout commented 1 year ago

Create toy dataset with parquet files having identical column types, but different column ordering.

import os
import tempfile

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as pds

t1 = pa.Table.from_pydict({"A": [1, 2, 3], "B": ["a", "b", "c"]})
t2 = pa.Table.from_pydict({"B": ["a", "b", "c"], "A": [1, 2, 3]})

temp_path = tempfile.mkdtemp()

pq.write_table(t1, os.path.join(temp_path, "t1.parquet"))
pq.write_table(t2, os.path.join(temp_path, "t2.parquet"))

ds = pds.dataset(temp_path)
print(ds.to_table())
pyarrow.Table
A: int64
B: string
----
A: [[1,2,3],[1,2,3]]
B: [["a","b","c"],["a","b","c"]]

Collect metadata of the individual files and create the (global) metadata file.

metadata_collector = [frag.metadata for frag in ds.get_fragments()]

metadata = metadata_collector[0]
metadata.append_row_groups(metadata_collector[1])
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[193], line 2
      1 metadata = metadata_collector[0]
----> 2 metadata.append_row_groups(metadata_collector[1])

File ~/mambaforge/envs/pydala-dev/lib/python3.11/site-packages/pyarrow/_parquet.pyx:793, in pyarrow._parquet.FileMetaData.append_row_groups()

RuntimeError: AppendRowGroups requires equal schemas.
The two columns with index 0 differ.
column descriptor = {
  name: A,
  path: A,
  physical_type: INT64,
  converted_type: NONE,
  logical_type: None,
  max_definition_level: 1,
  max_repetition_level: 0,
}
column descriptor = {
  name: B,
  path: B,
  physical_type: BYTE_ARRAY,
  converted_type: UTF8,
  logical_type: String,
  max_definition_level: 1,
  max_repetition_level: 0,
}
mapleFU commented 1 year ago
>>> metadata_collector[0].schema
<pyarrow._parquet.ParquetSchema object at 0x11e3cee80>
required group field_id=-1 schema {
  optional int64 field_id=-1 A;
  optional binary field_id=-1 B (String);
}

>>> metadata_collector[1].schema
<pyarrow._parquet.ParquetSchema object at 0x11e3ceec0>
required group field_id=-1 schema {
  optional binary field_id=-1 B (String);
  optional int64 field_id=-1 A;
}

Oh I got this. This is not allowed. Though it looks like it should be allowed.

Because Parquet schema is at "FileMetadata" ( see https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1024 ), so different row-group should have same schema.

legout commented 1 year ago

This means, there is no other solution than rewriting the data with a unique column ordering?

KernelA commented 8 months ago

I have a similar issue.

pyarrow 14.0.2

  parquet.write_metadata(
  File ".../lib/python3.9/site-packages/pyarrow/parquet/core.py", line 3589, in write_metadata
    metadata.append_row_groups(m)
  File "pyarrow/_parquet.pyx", line 807, in pyarrow._parquet.FileMetaData.append_row_groups
RuntimeError: AppendRowGroups requires equal schemas.
The two columns with index 0 differ.
column descriptor = {
  name: session_num,
  path: session_num,
  physical_type: INT64,
  converted_type: UINT_64,
  logical_type: Int(bitWidth=64, isSigned=false),
  max_definition_level: 0,
  max_repetition_level: 0,
}
column descriptor = {
  name: session_num,
  path: session_num,
  physical_type: INT64,
  converted_type: UINT_64,
  logical_type: Int(bitWidth=64, isSigned=false),
  max_definition_level: 1,
  max_repetition_level: 0,
}

All partitions have equal schemas. Example are taken from https://arrow.apache.org/docs/14.0/python/parquet.html#writing-metadata-and-common-metadata-files

KernelA commented 7 months ago

When all fields are nullable in the schema this error does not occur. I think it relates with https://github.com/apache/arrow/issues/31957