Add ability to re-apply KMS and bucket policy changes

We are contributing a feature that allows data.all to modify bucket policies to enable registering consumer roles directly on bucket policies. Additional if a bucket is encrypted with a KMS key then the KMS key policy must also be updated with consumer roles.

This means that if the S3 bucket or the KMS key were created with cloudformation that from the moment the bucket and kms key are imported into data.all the owners must STOP managing the bucket policy and the kms key policy because if they don't then any changes made by data.all will be overriden by cloudformation stack updates. We can warn and educate users that this may happen however mistakes can still be made - if there's is a way for something to happen then eventually it will happen.

In such a situation our only recourse would be to manually fix the KMS and bucket policies and add all the missing consumer roles which could be in the hundreds. Which is why I propose that once S3 bucket sharing is contributed to data.all that we work on a feature which will allow dataset owners to re-apply all policy changes made by SDK. Then owners of buckets and keys could in fact continue managing the policies through CF as long as they re-apply data.all permissions via UI.

@noah-paige @dlpzx please let me know what you think of this proposed feature.

Hi @zsaltys, thanks for opening an issue. This is a challenge that we already identified in the design of the S3 bucket sharing feature, meaning that we clearly see the point of this GitHub issue. We can work together on a feature for V2.X that ensures that even for imported S3 Buckets an KMS keys overrides from external IaC processes can be solved by data.all.

To "Fix corrupted shares" we need to implement some code:

code that scraps the S3 and KMS policies and ensures that all approved shares have their correspondent statement
code that re-applies the shares in case needed (maybe it is just a matter of re-triggering the share, I need to review the S3 Bucket sharing again)

And then we need to decide how we run the code. Here is where I would like your opinion because it what really affects the user experience.

On-demand vs on-schedule - Users triggering the "re-apply" from the UI or overnight scheduled "re-apply" task on all shares or a mix of both
In ECS tasks vs in Lambda - depending on the previous decision and the complexity of the code we will decide on the compute resource that's more adequate

Let me know your thoughts

@dlpzx thanks for your comments!

On-demand vs on-schedule. I think it is important to have it on demand. I would have it on dataset UI as a separate permission which we would normally grant to environment admin teams. The reason being for on demand is that if shares get corrupted (someone accidentally wiped them out for example by re-triggering CF stack) then they likely are very aware of the issue because data pipelines are failing so they will want to fix the production outage asap. However it is also possible that shares are corrupted, pipelines are failing but the issue is not yet reported, for example issue happens late Friday evening and pipelines are failing through weekend. For such cases we could also run it on-schedule once a night. For now I would say having it on-demand is much more important than on-schedule.
I think if we implement it as on-demand on a dataset then it should always be a very quick operation and able to finish in 15 minutes because we're just fixing KMS and bucket policy permissions and nothing else. However if you want to run it on-schedule then you're potentially re-checking thousands of shares which may easily run more than 15 minutes and should run in ECS. I think if we have the on-demand endpoints on graphql as lambdas then if we want to recheck all of them we could trigger all those graphql endpoints one by one from an ECS task as a future improvement.

I was also thinking a little more on this. I think for on-demand running we should have it in two places:

1) on the dataset to fix the entire dataset if an owner of a dataset breaks bucket policies or kms policies 2) on the share itself where both the request and approver could re-run a share. This is more in case if the owner or requester break lake formation sharing example someone accidentally deleting a share.

Both of these should be graphql endpoints which run asynchronously in a lambda and notify back on the notifications UI that a share has been fixed or dataset has been fixed.

Hi @zsaltys, following up on this feature. I created an issue to review the recent release of S3 Access Grants #900. We need to assess to what extend this new S3 feature can replace the current S3 mechanisms offered by data.all.

For the case in which we stay with S3/S3AccessPoints/KMS resource policies + IAM role policies sharing we will still need to design a solution for resource-policy overwrites. Here is a proposal of the different sub-features added (in order of priority)

This design does not include the config.json changes needed. For the moment we assume it is a feature that all customers get when deploying data.all. We will need to evaluate if we want to make it configurable.

1. on-demand Share requester "verify-share-request" & "reapply-share-request"

Use case: data consumer observes a failure in one of its consumption use-cases and wants to verify and re-apply the share to KMS and S3 resource-policies.

Backend

new table share_object_aws_status. This table has as foreign key the shareUri from share_object and has the columns S3PolicyStatus and KMSPolicyStatus. These columns can be booleans or strings if we want to add more options. An additional column lastVerificationTime is a datetime that records the last time the cerification was done.
API call backed by Lambda verify-share-request --> checks if the requester roles are in KMS and S3 policies and writes the information back to RDS share_object_aws_status.
API call backed by Lambda reapply-share-request --> adds the requester role in KMS and/or S3 policies based on the info in share_object_aws_status
Modify get-share API call to return share_object_aws_status fields

Frontend

Add definitions for verify-share-request and reapply-share-request API calls in the Shares/services
Add VerifyShareButton to the ShareView --> this button calls verify-share-request
Add Conditional ReApplyShareButton that is "enabled" depending on the value of share_object_aws_status fields returned by get-share

2. on-demand Dataset owner "verify-dataset-share-requests" & "reapply-dataset-share-requests"

Use-case: dataset owner has made changes in its dataset and wants to validate and if needed, remediate broken shares.

Backend

we can reuse the table share_object_aws_status defined in the sub-feature 1.
API call backed by Lambda/or ECS verify-dataset-share-requests --> checks if ALL the requesters' roles are in KMS and S3 policies and writes the information back to RDS share_object_aws_status for each of the dataset share requests.
API call backed by Lambda/or ECS reapply-dataset-share-requests --> adds ALL the requesters' role in KMS and/or S3 policies based on the info in share_object_aws_status for each of the dataset share requests.
Modify get-dataset API call to return share_object_aws_status fields or some stats about the dataset share requests.

Depending on the execution time of the task, we will need to offload (or not) the task to ECS. Imagine the scenario of X number of shares for the same dataset. In case ECS is needed:

Create an ECS Task that runs on demand in the IaC deploy directory.

Frontend

Add definitions for verify-dataset-share-requests and reapply-dataset-share-requests API calls in the Shares/services
Add VerifyDatasetSharesButton to the DatasetView/ Shares Tab--> this button calls verify-dataset-share-requests
Add Conditional ReApplyDatasetSharesButton that is "enabled" depending on the value of share_object_aws_status fields returned by get-dataset

3. scheduled "verify-all" and "re-apply-all"

Use case: as a guardrail, an scheduled task verifies and re-applies all shares to KMS and S3 resource-policies.

Backend

Create an ECS Task that runs on schedule in the IaC deploy directory.
Define the code for the task in the data_sharing module. It can re-use functions created in the previous sub-features.

Example 1 - Using share-level functions: loop over all share requests, invoke "verify-share" function that updates RDS table for each share. Query for all corrupted share requests. Loop over corrupted share requests and call "reapply-share-request" for each of the failed shares.

Example 2 - Using dataset-level functions: same but looping on the datasets instead of on the share requests.

I am personally more inclined for option 2 as it manages a dataset at once and ensures there are no conflicts between shares on the same dataset.

Frontend

Nothing, it runs in the background without user interaction.

This is an initial design, subtle to changes and modifications as we advance in the implementation.

@noah-paige @dlpzx There's some new insights into this problem.

There's a new scenario we discovered where people delete the table in Glue and expect the shares to keep working which obviously not possible because LakeFormation permissions get dropped.. However obviously the share will continue to show as "SHARED".

One way to deal with this would be to automatically test shares regularly and try and fix them automatically. But as we discussed that may not always be desirable. Others might expect that dropping a table and recreating it with the same name does not mean that it should be re-shared automatically.

What maybe would be better would be to regularly test shares that they are working and if they start failing to send a notification to the approver and the requester that the share is failing. I think notification could be sent every time the tester fails like once a day for example. This way there's a constant reminder to fix the issue whatever it was. This is the point where someone can go to the share UI and click a button to ask data-all to re-apply all of the permissions to attempt and fix the share or perhaps to remove the share.

I think only the APPROVER should be able to click the button to re-apply permissions because they are the ones who need to make the decision if the permissions need to be re-applied.

So in short the suggestion would be:

1) automatically and periodically test that shares are working and send notifications to requesters and approvers if they stop working 1.1) show on share UI if a share is failing 2) have a button on a share to fix it manually which only an approver can click 3) have a button on a dataset to fix all of it's shares in one swoop in case someone broke many shares by wiping out a bucket policy or kms policy

Share Verifier

Option 1 - For each Share Object
- Checks that the Necessary Permissions Exist on KMS Key Policy, Bucket Policy and IAM Roles (for Share Items of Type Folder & Bucket)
- Checks that the Necessary Permissions Exist on LakeFormation & RAM (for Share Items of Type Table)
- Pro - easily determine which part(s) of share are incorrect by looking at each piece again
- Con - re running process_share many times
Option 2
- Simulate a Consumer and Check if consumption is working
  - Athena query for Table Share
  - Copy command for Folder or Bucket Share
- Pro - more easily tell if share is broken or not
- Con - still unsure where source of error is if broken
Option 3 (inspiration from https://github.com/data-dot-all/dataall/issues/922#issuecomment-1912273168)
- Eventbridge on Updates to the Policies / Permissions
- Pro - real time fix
- Con - Management and Traceability may be more difficult if need many event bridge triggers (need to filter by API type and by source of API Request)
Updates a new table share_object_aws_status with
- Status —> Healthy, Unhealthy
- Message —> Message describing errors if unhealthy status
- Last Verification Check Time —> Timestamp

Re-Apply Share

Option 1 - Automatic Verify and ReApply broken shares on a schedule
Option 2 - Manual Re-Apply once Share determined unhealthy
- Notification for Approver and Button for Approver to click
Option 3 - Real Time Fix with EventBridge Triggers

Configuration

Option 1 - Default Behavior to include some level of Verify and ReApply of broken Shares
Option 2 - Add Parameters to config.json to enable/disable the share verifier/ re-apply feature

Making updates to the above options / considerations after internal alignment on this issues design:

Share Verifier

For Each Share Object
- Have a way to select a subset or all shared items a part of the object to be verified
- For Each Item to be Verified
  - FOR FOLDER / BUCKET SHARE: Checks that the Necessary Permissions Exist on KMS Key Policy, Bucket Policy and IAM Roles (for Share Items of Type Folder & Bucket)
  - FOR TABLE SHARE: Checks that the Necessary Permissions Exist on LakeFormation & RAM

Screenshot 2024-02-08 at 9 10 55 AM

Additional Columns in share_object_item table updated with findings
- Status —> Healthy, Unhealthy, Pending Check
- Message —> Message describing errors (if unhealthy status)
- Last Verification Check Time —> Timestamp

Screenshot 2024-02-08 at 9 09 54 AM

On State Change to Unhealthy
- Send Notification to Approver and Requester that the share for particular item(s) is broken

Re-Apply Share

If share is determined unhealthy
- Notification for Approver
- Button Enabled for Approver to ReApply Share (if desired)
  - Triggers Process Share Again
  - Process Share Code Robust Enough to Avoid Duplicate Entries if Already Exist

Additional Considerations

How to Trigger Share Verifier
- UI Button To “Verify Share” on demand
- Run on a Schedule Daily (ECS Task) - OPTIONAL
How to ReApply Shares
- Approver Only - For Each Share Object
- Dataset Owner Only - For Each Dataset - OPTIONAL

Going Further

Option 3 (from above) - real-time fixes of shares with event bridge
- Additional Arch Diagram - Consumer IAM role updates can get wiped out by CloudFormation #922 (comment))
- Eventbridge on Updates to the Policies / Permissions
- Possible Risks
  - Need to determine on what type of UpdatePolicy actions does it warrant a broken share (i.e. data.all updating kms key policy with another share approved VS. external IaC removing imported key policy via CFN Template)
  - Need to determine best way to trigger on LakeFormation Permission Updates (for Table Shares)
  - Need an elegant way to update event bridge triggers and/or create new ones as more datasets get onboarded and more shares get processed
- Benefits
  - Real Time Fix of Shares
  - Most Automated - No Manual Interaction required

Share Item Health Columns Lifecycle

Creating a new Share Item (i.e. Add Item to Share Object)
- HealthStatus —> None
- HealthMessage —> None
- lastVerificationTime —> None
Approving Share Item
- If Share_Succeeded
  - HealthStatus —> Healthy
  - HealthMessage --> None
  - lastVerificationTime --> datetime.now()
- If Share_Failed
  - HealthStatus (No Change)
  - HealthMessage (No Change)
  - lastVerificationTime (No Change)
Verifying Share Item
- If Unhealthy
  - HealthStatus —> Unhealthy
  - HealthMessage —> Error Message(s)
  - lastVerificationTime —> datetime.now()
- If Healthy
  - HealthStatus —> Healthy
  - HealthMessage --> None
  - lastVerificationTime —> datetime.now()
Reapply Share Item
- If Reapply Successful
  - HealthStatus —> Healthy
  - HealthMessage —> None
  - lastVerificationTime —> datetime.now()
- If Reapply Failed
  - HealthStatus —> Unhealthy
  - HealthMessage --> Error Message
  - lastVerificationTime —> datetime.now()
Revoke Share Item
- If Revoke_Succeeded
  - HealthStatus —> None
  - HealthMessage —> None
  - lastVerificationTime (No Change)
- If Revoke_Failed
  - HealthStatus (No Change)
  - HealthMessage (No Change)
  - lastVerificationTime (No Change)

data-dot-all / dataall