Checking file sizes - Githubissues

Hey @dioptre!

This use-case is technically possible today already like so:

# STEP 1: Run your own code to filter the list of files
files = get_files()
filtered_filepaths = [f["path"] for f in files if f["size"] < 100_000_000]

# STEP 2: Now use Daft to read it
df = daft.read_parquet(filtered_filepaths)

We could technically optimize it by using Daft to do step 1 -- is that currently a bottleneck in your workload and do you think that would be helpful?

# STEP 1: Receives a list of filepaths and performs a metadata fetch on each file (e.g. file size, created at etc)
df = daft.from_file_paths(["s3://...", "s3://...", "s3://..."])
df = df.where(df["size"] < 100_000_000)
filtered_filepaths = df.to_pydict()["filepaths"]

# STEP 2: Now use Daft to read it
df = daft.read_parquet(filtered_filepaths)

Eventual-Inc / Daft

Checking file sizes #2851