Open dioptre opened 4 days ago
Hey @dioptre!
This use-case is technically possible today already like so:
# STEP 1: Run your own code to filter the list of files
files = get_files()
filtered_filepaths = [f["path"] for f in files if f["size"] < 100_000_000]
# STEP 2: Now use Daft to read it
df = daft.read_parquet(filtered_filepaths)
We could technically optimize it by using Daft to do step 1 -- is that currently a bottleneck in your workload and do you think that would be helpful?
# STEP 1: Receives a list of filepaths and performs a metadata fetch on each file (e.g. file size, created at etc)
df = daft.from_file_paths(["s3://...", "s3://...", "s3://..."])
df = df.where(df["size"] < 100_000_000)
filtered_filepaths = df.to_pydict()["filepaths"]
# STEP 2: Now use Daft to read it
df = daft.read_parquet(filtered_filepaths)
Still would like something to check the size of a table before bringing it down the stream!
Could we get the list you suggested? https://github.com/Eventual-Inc/Daft/pull/1558