dask / fastparquet

python implementation of the parquet columnar file format.
Apache License 2.0
766 stars 176 forks source link

Ability to drop a partition (hive partitioned format) #691

Open Paul424 opened 2 years ago

Paul424 commented 2 years ago

In our application we mainly append new data to a parquet dataset, say when a user uploads content to our application. Sometimes however a user wants to close his account and we need to remove all data related to this user.

Luckily we partition by the user-id so we would simple run a "rm -rf" on those parquet files. Now obviously the issue is that these files are referenced by the _metadata.

We have been searching but can't find any ability to "edit" this top level metadata.

One option is to use the metadata_from_many but on larger datasets it's not a feasible option.

Questions we have: 1) What is the impact of simply removing those prqt files? 2) Are there "edit"-like libraries/tools available to modify _metadata?

Tips are appreciated, Paul

yohplala commented 2 years ago

Hi @Paul424 What a coincidence :) Your ticket appears to me closely related to #676 And I am actually working on it at this very moment through PR #689

Applied to your own need, when PR #689 will be completed, following code should do the trick:

# Your 'Parquet database'
pf = ParquetFile('/home/my_data/my_database')

# Identify row groups to remove, let say for customers 'Luc' and 'Georges'
# This works because you have used partition, otherwise, there would be a risk to delete data for other customers as well,
# as it will remove complete row groups
row_groups_to_remove = filter_row_groups(pf, [['customers', 'in', ['Luc', 'Georges']])
# Well... remove
pf.remove_row_groups(row_groups_to_remove)

I will certainly be interested in your feedback once the PR is completed. Bests

martindurant commented 2 years ago

For reference @Paul424 , the essence is to mutate the ParquetFile's .fmd.row_groups list to remove what you don't want, and then use write.write_metadata to update the _metadata file from the altered fmd. You can do this now without the linked PR, but it would take some lines of code.

yohplala commented 2 years ago

@Paul424 If you are interested in PR #689, you can clone the corresponding repo and give it a try. Beware that if you are accessing remote parquet dataset using fsspec, you will need to set remove_with and open_with parameters. Here is an example using LocalFileSystem. This may be subjected to change, depending @martindurant 's feedback regarding the code.

from fsspec.implementations.local import LocalFileSystem
fs = LocalFileSystem()
remove_with = (fs.rm_file, fs.listdir, fs.rmdir)
open_with = fs.open

...
pf.remove_row_groups(rgs, open_with=open_with, remove_with=remove_with)

Bests