apache / datafusion-ballista

Apache DataFusion Ballista Distributed Query Engine
https://datafusion.apache.org/ballista
Apache License 2.0
1.57k stars 198 forks source link

Could not create or read partition table #862

Closed smallzhongfeng closed 10 months ago

smallzhongfeng commented 1 year ago

Describe the bug After the partition table is created, it cannot be read normally

To Reproduce

echo "1,2" > tmp/year=2022/data.csv
echo "3,4" > tmp/year=2021/data.csv

run in ballista-cli


❯ CREATE EXTERNAL TABLE t2 (a INT, b INT) STORED AS CSV PARTITIONED BY (year) LOCATION 'tmp';
ArrowError(SchemaError("Unable to get field named \"year\". Valid fields: [\"a\", \"b\"]"))

I deployed it in standalone mode.

Expected behavior A clear and concise description of what you expected to happen.

Additional context Add any other context about the problem here.

smallzhongfeng commented 1 year ago

image I deployed it using the latest online version, and the client is also the latest version 0.11.0

smallzhongfeng commented 1 year ago

@thinkharderdev @yahoNanJing @Dandandan Have you ever encountered similar problems? Could you guys give me some advice

smallzhongfeng commented 1 year ago

Similar issue like this: https://github.com/apache/arrow-ballista/issues/747

smallzhongfeng commented 1 year ago
use datafusion::arrow::datatypes::DataType;
use datafusion::datasource::file_format::parquet::DEFAULT_PARQUET_EXTENSION;
use ballista::prelude::{BallistaConfig, BallistaContext, Result};
use datafusion::prelude::{CsvReadOptions, ParquetReadOptions, SessionContext};

#[tokio::main]
async fn main() -> Result<()> {
    let config = BallistaConfig::builder()
        .set("ballista.shuffle.partitions", "1")
        .build()?;

    let ctx = BallistaContext::standalone(&config, 2).await?;

    let options = ParquetReadOptions {
        file_extension: DEFAULT_PARQUET_EXTENSION,
        table_partition_cols: vec![("date".to_string(), DataType::Utf8)],
        parquet_pruning: Some(false),
        skip_metadata: Some(true),
    };
    let path= format!("tmp");

    let arc = ctx.read_parquet(&path, options).await?;
    println!("{}", arc.schema());
    arc.clone().select_columns(&["String", "date"]).unwrap();
    arc.clone().show().await?;
    Ok(())
}

This case also fail, so is it currently not supported to create a partition table?

yahoNanJing commented 1 year ago

Hi @smallzhongfeng, I'll take a look at this issue in this week.

smallzhongfeng commented 1 year ago

Thank you for your reply. @yahoNanJing At present, my guess is that the partition field is treated as an ordinary field, resulting in an error when the schema is matched.

smallzhongfeng commented 1 year ago

Any update ?

andreclaudino commented 1 year ago

It looks the partitions are ignored, and the files inside are not loaded. Is there any update on how to deal that?

bcmcmill commented 10 months ago

Any update?