Closed mpetri closed 2 years ago
@mpetri thanks for question.
to confirm, are you referring to reading partitioned files? I havent had to use that yet on my side, but i believe it should work. I will work on creating a test / example for it.
Sorry for the late reply. Yes. I'm talking about a bucket + prefix potentially containing many parquet files. e.g:
s3://my-awesome-data/data/year=2022/month=01/day=11/000001.parquet
s3://my-awesome-data/data/year=2022/month=01/day=11/000002.parquet
s3://my-awesome-data/data/year=2022/month=01/day=12/000001.parquet
s3://my-awesome-data/data/year=2022/month=01/day=12/000002.parquet
so I would want to:
let filename = "s3://my-awesome-data/data/";
let config = ListingTableConfig::new(s3_file_system, filename).infer().await?;
let table = ListingTable::try_new(config)?;
let mut ctx = ExecutionContext::new();
ctx.register_table("tbl", Arc::new(table))?;
let df = ctx.sql("SELECT * FROM tbl").await?;
df.show()
i believe this is already supported by the listing table provider, if you run into any issue, please feel free to file an issue in the upstream datafusion repo.
I have been following different issues in the main dataframe repo and this one and from what I can gather is that you want to enable processing multiple parquet files stored on s3. Is this already possible and if yes, is there an example on how this can be done?