data-dot-all / dataall

A modern data marketplace that makes collaboration among diverse users (like business, analysts and engineers) easier, increasing efficiency and agility in data projects on AWS.
https://data-dot-all.github.io/dataall/
Apache License 2.0
234 stars 82 forks source link

Data partion should be deleted automaically based on Bucket policy #508

Open rayeesn opened 1 year ago

rayeesn commented 1 year ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

We are using data with partitions, s3 configured with lifecycle policy to delete the data after a specific date then corresponding partition need to deleted based on policy

Describe the solution you'd like A clear and concise description of what you want to happen.

While creating dataset we need option to select "Want to delete partition based on bucket lifecycle", if select yes then we need to delete

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

P.S. Please Don't attach files. Add code snippets directly in the message body instead.

dlpzx commented 1 year ago

Hi @rayeesn,

I see your issue, you need to have this automation of glue partition deletion synchronized with your S3 deletion rules. How are you currently managing this operation?

Right now we are in the middle of refactoring the code (modularization, data.all v2.0.0). We won't be implementing new features until this workstream is complete in August 2023. BUt we can work on understanding the requirements and designing the features so that we can consider it for Q3-Q4 2023.

rayeesn commented 1 year ago

Currently we are running AirFlow scheduler to delete the partition based on retention policy,

dlpzx commented 1 year ago

Hi @rayeesn , we are getting back to you now that V2 is ready for release and we are evaluating future development. I am going to try to describe your feature request with more detail and I am going to ask you if this is what you are looking for. If there is any mismatch or more details are needed, please feel free to add them.

As a user, I would like to be able to create data.all datasets with a toggle that specifies "Want to delete partition based on bucket lifecycle" = true/false. If True, the Glue data partitions of the tables in that dataset need to be deleted as the corresponding data is deleted in the S3 bucket.

Some additional questions:

zsaltys commented 6 months ago

@dlpzx I would argue that data.all should not get into the business of managing S3 lifecycle or glue partitions retention. The users should manage that for themselves outside of data.all.

I propose to close this as "won't do".