data-dot-all / dataall

A modern data marketplace that makes collaboration among diverse users (like business, analysts and engineers) easier, increasing efficiency and agility in data projects on AWS.
https://data-dot-all.github.io/dataall/
Apache License 2.0
232 stars 82 forks source link

share revoke should not fail when a role is deleted #1141

Closed zsaltys closed 6 months ago

zsaltys commented 7 months ago

We've ran into multiple situations where our users delete a role which is actively used in data.all. We can detect this (with custom scripts) and data.all will let you revoke the S3 share via UI but it will fail trying to revoke a LakeFormation share with an error like:

Failed to get role ROLE_NAME due to: An error occurred (NoSuchEntity) when calling the GetRole operation: The role with name ROLE_NAME cannot be found.

This can only be fixed by manually updating the share to FAILED in RDS and then the share can be deleted via UI. This is very inconvenient because only data.all engineering team can fix such issues rather than making it a responsibility of the share owner.

I suspect that this issue also affects the "share validator" released in 2.3 and that it fails to fix a share when a role is deleted. I propose the following enhancement for this problem:

When attempting to revoke a share detect that a role has been deleted/missing and simply put the share into FAILED mode so that the share can be deleted via UI. This should also happen when share fixer is trying to fix the share either via manual request or when it runs in the background.

This fix is related to this issue https://github.com/data-dot-all/dataall/issues/1029. The above error was observed on 2.2 release.

anushka-singh commented 7 months ago

Confirming re-apply shares wont work if role was deleted for a successful share:

I had a successful share, then I deleted the consumption role. Verify shares, marked share as unhealthy with following errors: Screenshot 2024-04-03 at 9 37 09 AM When I hit re-apply shares, it does work and fails with error:

Screenshot 2024-04-03 at 9 41 17 AM
dlpzx commented 7 months ago

Thanks for opening an issue and being so clear. I think there are some methods in the consumption roles code that we could re-use to implement sharing guardrails. Do you have the bandwidth to take this item?

SofiaSazonova commented 7 months ago

I think, the most logical way to handle this is:

  1. Check the existence of the role during the healthcheck. Make it the first important statement in the result.
  2. If during the healthcheck the role is missing, dataall must revoke all resource permissions and delete dataall managed policies, the share must become 'failed'
  3. Revoke/Reapply share must lead to status 'Share failed', if there is no consumption role. Additionaly dataall must revoke all resource permissions and delete dataall managed policies

@zsaltys @dlpzx @anushka-singh what do you think?

SofiaSazonova commented 7 months ago

Also, @petrkalos suggested the scheme similar to proposed here .

  1. Create an EventBridge rule to listen to IAM role deletion/chane
  2. Handle this events in dataall central account

Above I described the reaction to the problem, but this approach will help us to detect and prevent it earlier.

dlpzx commented 6 months ago

Completed in #1161