apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.49k stars 1.02k forks source link

Support Glob Expressions for S3 #7393

Open a-agmon opened 10 months ago

a-agmon commented 10 months ago

Describe the bug

Im trying to register a listing table using a certain glob string: s3://somebucket/somepath/*44fad0765ac6-00001.parquet and receive the error

Error: Custom { kind: Other, error: 
"Object Store error: Object at location
 somepath/%2A44fad0765ac6-00001.parquet not found: 
response error \"No Body\", after 0 retries: HTTP status client error (404 Not Found) 
for url (https://s3.eu-west-1.amazonaws.com/somebucket/somepath/%252A44fad0765ac6-00001.parquet)" }

however, I know the file is there because the action succeeds when I just using the full file name s3://somebucket/somepath/00000-2367-2918fbc9-ea04-4927-a669-44fad0765ac6-00001.parquet

from the error, I have a feeling it's related to how the url and gloc char is being encoded, looks like its being encoded twice for some reason - from * -> %2A -> %252A

To Reproduce

   let glob_str = "*44fad0765ac6-00001.parquet";
   let table_files = String::from("somepath");
   let table_files_path = format!("s3://{bucket_name}/{table_files}/{glob_str}");
   let table_files_url = ListingTableUrl::parse(&table_files_path)?;
   let schema_file_path = format!("s3://{bucket_name}/{table_files}/00000-2367-2918fbc9-ea04-4927-a669-44fad0765ac6-00001.parquet");
   let schema_file_url = ListingTableUrl::parse(&schema_file_path)?;
   println!("table_files_url: {}", table_files_path);

   let file_format = ParquetFormat::default()
.with_enable_pruning(Some(true));
   let listing_options = ListingOptions::new(Arc::new(file_format))
.with_file_extension(".parquet");

    let schema = listing_options.infer_schema(&ctx.state(), &schema_file_url).await?;

    let table_config = 
        ListingTableConfig::new(table_files_url) // the path of the files
        .with_listing_options(listing_options)
        .with_schema(schema);

    let table_provider = Arc::new(ListingTable::try_new(table_config).unwrap());

    ctx.register_table(TableReference::from("users"),table_provider).unwrap();
    let df = ctx.sql("SELECT count(*) FROM users").await?;
    df.show().await?;

Expected behavior

the files that correspond to the glob string will be registered as a table. In the code above if I just switch between the ListingTableUrl with the globe string to that of the schema the code executes successfully (so no permissions issue)/

Additional context

My Cargo.Toml

aws-config = "0.55.3"
aws-sdk-s3 = "0.28.0"
object_store = { version = "0.6.1", features = ["aws"] }
url = "2.4.0"  
async-graphql = "6.0.4"
async-graphql-actix-web = { version = "6.0.4"}
actix-web =  { version = "4.3.1" }
datafusion = "28.0.0"
tokio = { version = "1.32.0", features = ["full"] }
tustvold commented 10 months ago

Glob expressions are currently only supported for local files, the docs can be found here - https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTableUrl.html#paths-without-a-scheme

I can't remember the context on why though, I think it was to avoid needing to escape glob expression... It should be possible to add an API to allow creating a listing table with a prefix and a glob expression

a-agmon commented 10 months ago

Thanks @tustvold , I think its very useful, especially when we want to register a table on a certain partition path - e.g. s3://somebucket/data/source=*/env=prod/*.parquet