Closed zsaltys closed 8 months ago
@noah-paige @dlpzx please let me know what you think of this proposed feature.
Hi @zsaltys, thanks for opening an issue. This is a challenge that we already identified in the design of the S3 bucket sharing feature, meaning that we clearly see the point of this GitHub issue. We can work together on a feature for V2.X that ensures that even for imported S3 Buckets an KMS keys overrides from external IaC processes can be solved by data.all.
To "Fix corrupted shares" we need to implement some code:
And then we need to decide how we run the code. Here is where I would like your opinion because it what really affects the user experience.
Let me know your thoughts
@dlpzx thanks for your comments!
On-demand vs on-schedule. I think it is important to have it on demand. I would have it on dataset UI as a separate permission which we would normally grant to environment admin teams. The reason being for on demand is that if shares get corrupted (someone accidentally wiped them out for example by re-triggering CF stack) then they likely are very aware of the issue because data pipelines are failing so they will want to fix the production outage asap. However it is also possible that shares are corrupted, pipelines are failing but the issue is not yet reported, for example issue happens late Friday evening and pipelines are failing through weekend. For such cases we could also run it on-schedule once a night. For now I would say having it on-demand is much more important than on-schedule.
I think if we implement it as on-demand on a dataset then it should always be a very quick operation and able to finish in 15 minutes because we're just fixing KMS and bucket policy permissions and nothing else. However if you want to run it on-schedule then you're potentially re-checking thousands of shares which may easily run more than 15 minutes and should run in ECS. I think if we have the on-demand endpoints on graphql as lambdas then if we want to recheck all of them we could trigger all those graphql endpoints one by one from an ECS task as a future improvement.
I was also thinking a little more on this. I think for on-demand running we should have it in two places:
1) on the dataset to fix the entire dataset if an owner of a dataset breaks bucket policies or kms policies 2) on the share itself where both the request and approver could re-run a share. This is more in case if the owner or requester break lake formation sharing example someone accidentally deleting a share.
Both of these should be graphql endpoints which run asynchronously in a lambda and notify back on the notifications UI that a share has been fixed or dataset has been fixed.
Hi @zsaltys, following up on this feature. I created an issue to review the recent release of S3 Access Grants #900. We need to assess to what extend this new S3 feature can replace the current S3 mechanisms offered by data.all.
For the case in which we stay with S3/S3AccessPoints/KMS resource policies + IAM role policies sharing we will still need to design a solution for resource-policy overwrites. Here is a proposal of the different sub-features added (in order of priority)
This design does not include the config.json
changes needed. For the moment we assume it is a feature that all customers get when deploying data.all. We will need to evaluate if we want to make it configurable.
Use case: data consumer observes a failure in one of its consumption use-cases and wants to verify and re-apply the share to KMS and S3 resource-policies.
share_object_aws_status
. This table has as foreign key the shareUri from share_object
and has the columns S3PolicyStatus
and KMSPolicyStatus
. These columns can be booleans or strings if we want to add more options. An additional column lastVerificationTime
is a datetime that records the last time the cerification was done.verify-share-request
--> checks if the requester roles are in KMS and S3 policies and writes the information back to RDS share_object_aws_status
.reapply-share-request
--> adds the requester role in KMS and/or S3 policies based on the info in share_object_aws_status
get-share
API call to return share_object_aws_status
fieldsverify-share-request
and reapply-share-request
API calls in the Shares/servicesVerifyShareButton
to the ShareView --> this button calls verify-share-request
ReApplyShareButton
that is "enabled" depending on the value of share_object_aws_status
fields returned by get-share
Use-case: dataset owner has made changes in its dataset and wants to validate and if needed, remediate broken shares.
share_object_aws_status
defined in the sub-feature 1.verify-dataset-share-requests
--> checks if ALL the requesters' roles are in KMS and S3 policies and writes the information back to RDS share_object_aws_status
for each of the dataset share requests.reapply-dataset-share-requests
--> adds ALL the requesters' role in KMS and/or S3 policies based on the info in share_object_aws_status
for each of the dataset share requests.get-dataset
API call to return share_object_aws_status
fields or some stats about the dataset share requests.Depending on the execution time of the task, we will need to offload (or not) the task to ECS. Imagine the scenario of X number of shares for the same dataset. In case ECS is needed:
deploy
directory.verify-dataset-share-requests
and reapply-dataset-share-requests
API calls in the Shares/servicesVerifyDatasetSharesButton
to the DatasetView/ Shares Tab--> this button calls verify-dataset-share-requests
ReApplyDatasetSharesButton
that is "enabled" depending on the value of share_object_aws_status
fields returned by get-dataset
Use case: as a guardrail, an scheduled task verifies and re-applies all shares to KMS and S3 resource-policies.
deploy
directory.Example 1 - Using share-level functions: loop over all share requests, invoke "verify-share" function that updates RDS table for each share. Query for all corrupted share requests. Loop over corrupted share requests and call "reapply-share-request" for each of the failed shares.
Example 2 - Using dataset-level functions: same but looping on the datasets instead of on the share requests.
I am personally more inclined for option 2 as it manages a dataset at once and ensures there are no conflicts between shares on the same dataset.
Nothing, it runs in the background without user interaction.
This is an initial design, subtle to changes and modifications as we advance in the implementation.
@noah-paige @dlpzx There's some new insights into this problem.
There's a new scenario we discovered where people delete the table in Glue and expect the shares to keep working which obviously not possible because LakeFormation permissions get dropped.. However obviously the share will continue to show as "SHARED".
One way to deal with this would be to automatically test shares regularly and try and fix them automatically. But as we discussed that may not always be desirable. Others might expect that dropping a table and recreating it with the same name does not mean that it should be re-shared automatically.
What maybe would be better would be to regularly test shares that they are working and if they start failing to send a notification to the approver and the requester that the share is failing. I think notification could be sent every time the tester fails like once a day for example. This way there's a constant reminder to fix the issue whatever it was. This is the point where someone can go to the share UI and click a button to ask data-all to re-apply all of the permissions to attempt and fix the share or perhaps to remove the share.
I think only the APPROVER should be able to click the button to re-apply permissions because they are the ones who need to make the decision if the permissions need to be re-applied.
So in short the suggestion would be:
1) automatically and periodically test that shares are working and send notifications to requesters and approvers if they stop working 1.1) show on share UI if a share is failing 2) have a button on a share to fix it manually which only an approver can click 3) have a button on a dataset to fix all of it's shares in one swoop in case someone broke many shares by wiping out a bucket policy or kms policy
share_object_aws_status
with
Configuration
config.json
to enable/disable the share verifier/ re-apply featureMaking updates to the above options / considerations after internal alignment on this issues design:
share_object_item
table updated with findings
How to Trigger Share Verifier
How to ReApply Shares
We are contributing a feature that allows data.all to modify bucket policies to enable registering consumer roles directly on bucket policies. Additional if a bucket is encrypted with a KMS key then the KMS key policy must also be updated with consumer roles.
This means that if the S3 bucket or the KMS key were created with cloudformation that from the moment the bucket and kms key are imported into data.all the owners must STOP managing the bucket policy and the kms key policy because if they don't then any changes made by data.all will be overriden by cloudformation stack updates. We can warn and educate users that this may happen however mistakes can still be made - if there's is a way for something to happen then eventually it will happen.
In such a situation our only recourse would be to manually fix the KMS and bucket policies and add all the missing consumer roles which could be in the hundreds. Which is why I propose that once S3 bucket sharing is contributed to data.all that we work on a feature which will allow dataset owners to re-apply all policy changes made by SDK. Then owners of buckets and keys could in fact continue managing the policies through CF as long as they re-apply data.all permissions via UI.