Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.09k stars 141 forks source link

Checking file sizes #2851

Open dioptre opened 4 days ago

dioptre commented 4 days ago

Still would like something to check the size of a table before bringing it down the stream!

Could we get the list you suggested? https://github.com/Eventual-Inc/Daft/pull/1558

jaychia commented 1 day ago

Hey @dioptre!

This use-case is technically possible today already like so:

# STEP 1: Run your own code to filter the list of files
files = get_files()
filtered_filepaths = [f["path"] for f in files if f["size"] < 100_000_000]

# STEP 2: Now use Daft to read it
df = daft.read_parquet(filtered_filepaths)

We could technically optimize it by using Daft to do step 1 -- is that currently a bottleneck in your workload and do you think that would be helpful?

# STEP 1: Receives a list of filepaths and performs a metadata fetch on each file (e.g. file size, created at etc)
df = daft.from_file_paths(["s3://...", "s3://...", "s3://..."])
df = df.where(df["size"] < 100_000_000)
filtered_filepaths = df.to_pydict()["filepaths"]

# STEP 2: Now use Daft to read it
df = daft.read_parquet(filtered_filepaths)