data-dot-all / dataall

A modern data marketplace that makes collaboration among diverse users (like business, analysts and engineers) easier, increasing efficiency and agility in data projects on AWS.
https://data-dot-all.github.io/dataall/
Apache License 2.0
235 stars 82 forks source link

Add ability to re-apply KMS and bucket policy changes #827

Closed zsaltys closed 8 months ago

zsaltys commented 1 year ago

We are contributing a feature that allows data.all to modify bucket policies to enable registering consumer roles directly on bucket policies. Additional if a bucket is encrypted with a KMS key then the KMS key policy must also be updated with consumer roles.

This means that if the S3 bucket or the KMS key were created with cloudformation that from the moment the bucket and kms key are imported into data.all the owners must STOP managing the bucket policy and the kms key policy because if they don't then any changes made by data.all will be overriden by cloudformation stack updates. We can warn and educate users that this may happen however mistakes can still be made - if there's is a way for something to happen then eventually it will happen.

In such a situation our only recourse would be to manually fix the KMS and bucket policies and add all the missing consumer roles which could be in the hundreds. Which is why I propose that once S3 bucket sharing is contributed to data.all that we work on a feature which will allow dataset owners to re-apply all policy changes made by SDK. Then owners of buckets and keys could in fact continue managing the policies through CF as long as they re-apply data.all permissions via UI.

zsaltys commented 1 year ago

@noah-paige @dlpzx please let me know what you think of this proposed feature.

dlpzx commented 1 year ago

Hi @zsaltys, thanks for opening an issue. This is a challenge that we already identified in the design of the S3 bucket sharing feature, meaning that we clearly see the point of this GitHub issue. We can work together on a feature for V2.X that ensures that even for imported S3 Buckets an KMS keys overrides from external IaC processes can be solved by data.all.

To "Fix corrupted shares" we need to implement some code:

  1. code that scraps the S3 and KMS policies and ensures that all approved shares have their correspondent statement
  2. code that re-applies the shares in case needed (maybe it is just a matter of re-triggering the share, I need to review the S3 Bucket sharing again)

And then we need to decide how we run the code. Here is where I would like your opinion because it what really affects the user experience.

  1. On-demand vs on-schedule - Users triggering the "re-apply" from the UI or overnight scheduled "re-apply" task on all shares or a mix of both
  2. In ECS tasks vs in Lambda - depending on the previous decision and the complexity of the code we will decide on the compute resource that's more adequate

Let me know your thoughts

zsaltys commented 11 months ago

@dlpzx thanks for your comments!

  1. On-demand vs on-schedule. I think it is important to have it on demand. I would have it on dataset UI as a separate permission which we would normally grant to environment admin teams. The reason being for on demand is that if shares get corrupted (someone accidentally wiped them out for example by re-triggering CF stack) then they likely are very aware of the issue because data pipelines are failing so they will want to fix the production outage asap. However it is also possible that shares are corrupted, pipelines are failing but the issue is not yet reported, for example issue happens late Friday evening and pipelines are failing through weekend. For such cases we could also run it on-schedule once a night. For now I would say having it on-demand is much more important than on-schedule.

  2. I think if we implement it as on-demand on a dataset then it should always be a very quick operation and able to finish in 15 minutes because we're just fixing KMS and bucket policy permissions and nothing else. However if you want to run it on-schedule then you're potentially re-checking thousands of shares which may easily run more than 15 minutes and should run in ECS. I think if we have the on-demand endpoints on graphql as lambdas then if we want to recheck all of them we could trigger all those graphql endpoints one by one from an ECS task as a future improvement.

I was also thinking a little more on this. I think for on-demand running we should have it in two places:

1) on the dataset to fix the entire dataset if an owner of a dataset breaks bucket policies or kms policies 2) on the share itself where both the request and approver could re-run a share. This is more in case if the owner or requester break lake formation sharing example someone accidentally deleting a share.

Both of these should be graphql endpoints which run asynchronously in a lambda and notify back on the notifications UI that a share has been fixed or dataset has been fixed.

dlpzx commented 11 months ago

Hi @zsaltys, following up on this feature. I created an issue to review the recent release of S3 Access Grants #900. We need to assess to what extend this new S3 feature can replace the current S3 mechanisms offered by data.all.

For the case in which we stay with S3/S3AccessPoints/KMS resource policies + IAM role policies sharing we will still need to design a solution for resource-policy overwrites. Here is a proposal of the different sub-features added (in order of priority)

This design does not include the config.json changes needed. For the moment we assume it is a feature that all customers get when deploying data.all. We will need to evaluate if we want to make it configurable.

1. on-demand Share requester "verify-share-request" & "reapply-share-request"

Use case: data consumer observes a failure in one of its consumption use-cases and wants to verify and re-apply the share to KMS and S3 resource-policies.

Backend

Frontend

2. on-demand Dataset owner "verify-dataset-share-requests" & "reapply-dataset-share-requests"

Use-case: dataset owner has made changes in its dataset and wants to validate and if needed, remediate broken shares.

Backend

Depending on the execution time of the task, we will need to offload (or not) the task to ECS. Imagine the scenario of X number of shares for the same dataset. In case ECS is needed:

Frontend

3. scheduled "verify-all" and "re-apply-all"

Use case: as a guardrail, an scheduled task verifies and re-applies all shares to KMS and S3 resource-policies.

Backend

Example 1 - Using share-level functions: loop over all share requests, invoke "verify-share" function that updates RDS table for each share. Query for all corrupted share requests. Loop over corrupted share requests and call "reapply-share-request" for each of the failed shares.

Example 2 - Using dataset-level functions: same but looping on the datasets instead of on the share requests.

I am personally more inclined for option 2 as it manages a dataset at once and ensures there are no conflicts between shares on the same dataset.

Frontend

Nothing, it runs in the background without user interaction.

This is an initial design, subtle to changes and modifications as we advance in the implementation.

zsaltys commented 10 months ago

@noah-paige @dlpzx There's some new insights into this problem.

There's a new scenario we discovered where people delete the table in Glue and expect the shares to keep working which obviously not possible because LakeFormation permissions get dropped.. However obviously the share will continue to show as "SHARED".

One way to deal with this would be to automatically test shares regularly and try and fix them automatically. But as we discussed that may not always be desirable. Others might expect that dropping a table and recreating it with the same name does not mean that it should be re-shared automatically.

What maybe would be better would be to regularly test shares that they are working and if they start failing to send a notification to the approver and the requester that the share is failing. I think notification could be sent every time the tester fails like once a day for example. This way there's a constant reminder to fix the issue whatever it was. This is the point where someone can go to the share UI and click a button to ask data-all to re-apply all of the permissions to attempt and fix the share or perhaps to remove the share.

I think only the APPROVER should be able to click the button to re-apply permissions because they are the ones who need to make the decision if the permissions need to be re-applied.

So in short the suggestion would be:

1) automatically and periodically test that shares are working and send notifications to requesters and approvers if they stop working 1.1) show on share UI if a share is failing 2) have a button on a share to fix it manually which only an approver can click 3) have a button on a dataset to fix all of it's shares in one swoop in case someone broke many shares by wiping out a bucket policy or kms policy

noah-paige commented 9 months ago

Share Verifier

Re-Apply Share

Configuration

noah-paige commented 9 months ago

Making updates to the above options / considerations after internal alignment on this issues design:

Share Verifier

Screenshot 2024-02-08 at 9 10 55 AM

Screenshot 2024-02-08 at 9 09 54 AM

Re-Apply Share

Additional Considerations

Going Further

noah-paige commented 9 months ago

Share Item Health Columns Lifecycle