duckdb / duckdb-rs

Ergonomic bindings to duckdb for Rust
MIT License
511 stars 113 forks source link

Setting produce_arrow_string_view has no effect, still returns Utf8/LargeUtf arrow types #396

Open jhorstmann opened 1 week ago

jhorstmann commented 1 week ago

What happens?

I am using the duckdb-rs crate to query an embedded duckdb and return the results as arrow record batches. I would like string types to be returned as arrow StringView types for more efficient memory usage, but it seems the produce_arrow_string_view setting has no effect on the returned data type.

To Reproduce

The issue can be reproduced with the following rust program. I did not see any type conversions in the rust wrapper itself, so I assume the issue in in the duckdb core.

[package]
name = "duckdb-arrow-test"
version = "0.1.0"
edition = "2021"

[dependencies]
duckdb = { version = "1.0.0", features = ["bundled"] }
use duckdb::Connection;

fn main() {
    let conn = Connection::open_in_memory().unwrap();

    let setup_script = r"
        SET arrow_output_list_view = true;
        SET produce_arrow_string_view = true;
        ";

    conn.execute_batch(&setup_script).unwrap();

    let mut query = conn
        .prepare("SELECT (i*10^i)::varchar AS str FROM range(5) tbl(i)")
        .unwrap();

    let arrow = query.query_arrow([]).unwrap();

    for batch in arrow {
        dbg!(batch.schema().field(0).data_type());
        dbg!(batch.column(0));
    }
}

Output:

$ cargo run
     Running `target/debug/duckdb-arrow-test`
[src/main.rs:20:9] batch.schema().field(0).data_type() = Utf8
[src/main.rs:21:9] batch.column(0) = StringArray
[
  "0.0",
  "10.0",
  "200.0",
  "3000.0",
  "40000.0",
]

The arrow_large_buffer_size setting correctly changes the data type to LargeUtf8 instead of Utf8.

OS:

x86_64 linux ubuntu

DuckDB Version:

1.1.1

DuckDB Client:

rust (duckdb-rs)

Hardware:

No response

Full Name:

Jörn Horstmann

Affiliation:

SAP SE

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Not applicable - the reproduction does not require a data set

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?