apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.02k stars 1.14k forks source link

Concise API to create DataFrame from collection #12574

Open comphead opened 1 week ago

comphead commented 1 week ago

I'm feeling we need to have something to create DF from rows in addition to creating DF from data files.

Currently DataFrames being created from logical plans or reading files. Having the API to create DataFrame from collections will make easier to play with test data and adding examples/documentation

Example can be

let schema = Arc::new(Schema::new(vec![
        Field::new("a", DataType::Utf8, false),
        Field::new("b", DataType::Int32, false),
    ]));

let data: Vec<ArrayRef> = 
DataFrame::from(schema, data)

Underneath the method can call ctx.read_batch(record_batch). The batch can be created with RecordBatch::try_from_iter or try_new

The very good start is in dataframe_in_memory.rs and it can be seen how many code needed just to create a dataframe on top of the schema and data, so idea to make a more concise API

Originally posted by @comphead in https://github.com/apache/datafusion/issues/12564#issuecomment-2365265416

timsaucer commented 1 week ago

This is a great idea. We have some work in datafusion-python we might be able to reuse.

timsaucer commented 1 week ago

I did a quick proof of concept. Does this match what you're looking for?

#[tokio::main]
async fn main() -> Result<()> {
    let batch = create_batch!(
        ("a", Int32, vec![1, 2, 3]),
        ("b", Float64, vec![Some(4.0), None, Some(5.0)]),
        ("c", Utf8, vec!["alpha", "beta", "gamma"])
    )?;

    let ctx = SessionContext::new();
    ctx.read_batch(batch)?.show().await
}

Output is

+---+-----+-------+
| a | b   | c     |
+---+-----+-------+
| 1 | 4.0 | alpha |
| 2 |     | beta  |
| 3 | 5.0 | gamma |
+---+-----+-------+