apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.24k stars 3.47k forks source link

[Python] pyarrow.concat_tables raises error about different Schema if columns have different order #35424

Open einsone opened 1 year ago

einsone commented 1 year ago

Describe the usage question you have. Please include as many useful details as possible.

why different columns order result in different schema?

the following code raise:

pyarrow.lib.ArrowInvalid: Schema at index 1 was different:

import pandas as pd
import pyarrow as pa

print(pa.show_info())

df1 = pd.DataFrame({
    "col1": [1,2,3,4,5],
    "col2": ["A", "B", "C", "D", "E"],
})

df2 = pd.DataFrame({
    "col2": ["A", "B", "C", "D", "E"],
    "col1": [1,2,3,4,5],
})

tbl1 = pa.Table.from_pandas(df1, preserve_index=False)
tbl2 = pa.Table.from_pandas(df2, preserve_index=False)

tbl3 = pa.concat_tables([tbl1, tbl2])

Component(s)

C++, Python

einsone commented 1 year ago

pyarrow version info

Package kind : python-wheel-manylinux2014 Arrow C++ library version : 12.0.0
Arrow C++ compiler : GNU 10.2.1 Arrow C++ compiler flags : -fdiagnostics-color=always Arrow C++ git revision :
Arrow C++ git description :
Arrow C++ build type : release

Platform: OS / Arch : Linux x86_64 SIMD Level : avx2
Detected SIMD Level : avx2

Memory: Default backend : jemalloc Bytes allocated : 0 bytes Max memory : 0 bytes Supported Backends : jemalloc, mimalloc, system

Optional modules: csv : Enabled cuda : -
dataset : Enabled feather : Enabled flight : Enabled fs : Enabled gandiva : -
json : Enabled orc : Enabled parquet : Enabled

Filesystems: GcsFileSystem : Enabled HadoopFileSystem : Enabled S3FileSystem : Enabled

Compression Codecs: brotli : Enabled bz2 : Enabled gzip : Enabled lz4_frame : Enabled lz4 : Enabled snappy : Enabled zstd : Enabled

westonpace commented 1 year ago

AFAIK, there is no way for Arrow to consistently determine the correct order. In Arrow, columns are allowed to have duplicate names so something like this would be allowed:

tab1 = pa.Table.from_pydict({
    "col": [1,2,3,4,5],
    "col": [6, 7, 8, 9, 10],
})

tab2 = pa.Table.from_pydict({
    "col": [6, 7, 8, 9,10],
    "col": [1,2,3,4,5],
})

Two tables with different schemas can't be combined. You will need to normalize the schema in your code (or perhaps pandas) before providing it to Arrow.