Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.2k stars 147 forks source link

Provide ability to filter out invalid parquets or parquet metadata from glob path #3008

Open MisterKloudy opened 1 week ago

MisterKloudy commented 1 week ago

When there are malformed parquet files (such as invalid footer, invalid metadata) due to upstream writes which were interrupted or disrupted, the downstream jobs are all unable to run and any such problem is completely blocking.

For cases where the success of the remainder of the jobs are more important than the completeness of the data, this poses a big problem as the filtering out or deletion of malformed files plus consequent retriggering of the tasks can be tedious and manual.

It would be good if there can be the choice to either (1) end the job on such errors (status quo), or to (2) write out invalid paths into a separate small log file and continue with the job.

jaychia commented 6 days ago

This is really tricky... We need to balance user experience with not "failing silently"

What if we provided a way to pass in a list of filepaths, and run some kind of Daft expression to validate if each of those files were valid Parquet? Users can then run these jobs as a mechanism to figure out which Parquet files are invalid in a given glob.

e.g.

df = df.from_glob_paths("s3://...")
df = df.where(~df["path"].is_parquet_file())
df.collect()

Then a user could potentially do cleanups by themselves if they have a job that is failing because some Parquet files are invalid/corrupted.

MisterKloudy commented 6 days ago

Yes I think that would be a great way to go about it! I wouldn't mind having to call an additional function.

I actually tend to use the /** directly in read_parquet though so I guess this needs to be able to do the filter before the read fails?

Or maybe a daft.validate_parquet() to keep it completely separate?

Also, failing silently wouldn't be the norm if we had to activate the functionality with another function or new parameter. It would be controlled failing with nice fallbacks! I definitely wouldn't want it to be the default behaviour for people who aren't looking for it.