Describe the bug
ORDER BY is ignored when COPYing from a pyarrow table to a csv file
This happens both for tables created with pyarrow.Table.from_pydict and from ORC files.
To Reproduce
with datafusion 35.0.0 installed from PyPI:
from pathlib import Path
import datafusion
import pyarrow.csv
import pyarrow.dataset
ctx = datafusion.SessionContext()
ctx.from_arrow_table(pyarrow.Table.from_pydict({'value': [2, 1, 3]}), "content")
output_path = Path("/tmp/output.csv")
query = f"""
COPY (SELECT value FROM content ORDER BY value DESC)
TO '{output_path}' (
FORMAT CSV,
)
"""
df = ctx.sql(query)
columns = df.schema().names
assert columns == ["value"], columns
df.count() # force the query to run
print(output_path.read_text())
use std::sync::Arc;
use datafusion::arrow::array::PrimitiveArray;
use datafusion::arrow::datatypes::{DataType, Field, Schema, Int64Type};
use datafusion::arrow::record_batch::RecordBatch;
use datafusion::prelude::*;
use datafusion::datasource::MemTable;
#[tokio::main]
async fn main() {
let ctx = SessionContext::new();
let schema = Arc::new(Schema::new(vec![Field::new("value", DataType::Int64, false)]));
let column: PrimitiveArray<Int64Type> = vec![2, 1, 3].into();
let partition = RecordBatch::try_new(schema.clone(), vec![Arc::new(column)]).unwrap();
let table = MemTable::try_new(schema, vec![vec![partition]]).unwrap();
ctx.register_table("content", Arc::new(table)).unwrap();
let df = ctx.sql("COPY (SELECT value FROM content ORDER BY value) TO '/tmp/output.csv'").await.unwrap();
df.collect().await.unwrap();
}
Describe the bug ORDER BY is ignored when COPYing from a pyarrow table to a csv file
This happens both for tables created with
pyarrow.Table.from_pydict
and from ORC files.To Reproduce
with
datafusion 35.0.0
installed from PyPI:Expected behavior Should print
Additional context
I tried to reproduce it directly in Rust (https://github.com/apache/arrow-datafusion/issues/9463), but this code does produce a sorted output as expected: