datafusion-contrib / datafusion-objectstore-s3

S3 as an ObjectStore for DataFusion
Apache License 2.0
59 stars 13 forks source link

Multi parquet s3 example? #49

Closed mpetri closed 2 years ago

mpetri commented 2 years ago

I have been following different issues in the main dataframe repo and this one and from what I can gather is that you want to enable processing multiple parquet files stored on s3. Is this already possible and if yes, is there an example on how this can be done?

matthewmturner commented 2 years ago

@mpetri thanks for question.

to confirm, are you referring to reading partitioned files? I havent had to use that yet on my side, but i believe it should work. I will work on creating a test / example for it.

mpetri commented 2 years ago

Sorry for the late reply. Yes. I'm talking about a bucket + prefix potentially containing many parquet files. e.g:

s3://my-awesome-data/data/year=2022/month=01/day=11/000001.parquet
s3://my-awesome-data/data/year=2022/month=01/day=11/000002.parquet
s3://my-awesome-data/data/year=2022/month=01/day=12/000001.parquet
s3://my-awesome-data/data/year=2022/month=01/day=12/000002.parquet

so I would want to:

let filename = "s3://my-awesome-data/data/";
let config = ListingTableConfig::new(s3_file_system, filename).infer().await?;
let table = ListingTable::try_new(config)?;
let mut ctx = ExecutionContext::new();
ctx.register_table("tbl", Arc::new(table))?;
let df = ctx.sql("SELECT * FROM tbl").await?;
df.show()
houqp commented 2 years ago

i believe this is already supported by the listing table provider, if you run into any issue, please feel free to file an issue in the upstream datafusion repo.