data-dot-all / dataall

A modern data marketplace that makes collaboration among diverse users (like business, analysts and engineers) easier, increasing efficiency and agility in data projects on AWS.
https://data-dot-all.github.io/dataall/
Apache License 2.0
228 stars 82 forks source link

Limit Pivot Role S3 permissions #580

Closed dlpzx closed 10 months ago

dlpzx commented 1 year ago

πŸ†• [UPDATED WITH THE FEEDBACK FROM COMMENTS] Is your idea related to a problem? Please describe. Currently the data.all pivotRole requires permission to all S3 Buckets and KMS keys in the AWS account

Describe the solution you'd like A solution in which access to the data on S3 buckets is restricted to specific roles only. We would like to prevent any data access from other accounts, especially if the pivot role gets compromised.


Analysis of roles in data.all

In data.all central account

role used in
graphql-role Graphql Lambda function
worker-role Worker Lambda function
esproxy-role OpenSearch proxy Lambda function
ecs-role ECS tasks

In the environment accounts

In addition there is also the cdk execution role used for CDK deployments, but it is not relevant for this issue.

role creation permissions trust policies (assumable by)
Pivot role If AUTO-CREATE = True it is created as part of the environment stack, otherwise it is manually created Read and write permissions to the TEAM-OWNED dataset buckets and to the Glue databases and tables + permissions to work with the enabled modules, but only to the TEAM-OWNED resources based on resource policies and by graphql-role, worker-role and ecs-role from the central account
Environment teams' Roles As part of the environment stack, one for each team that is added to the environment Read and write permissions to the TEAM-OWNED dataset buckets and to the Glue databases and tables + permissions to work with the enabled modules, but only to the TEAM-OWNED resources based on resource policies and tags. by several module services + by the pivot-Role in the account
Dataset Role As part of the dataset stack, for both imported and created datasets Read and write permissions to the dataset bucket and to the Glue database and tables of the dataset + it can write profiling results in its corresponding folder in the environment bucket by Glue + by the pivot-Role in the account

In this diagram we can see all roles and the SDK calls that they need to perform. It includes the changes that we want to implement.

IAMRoles drawio


Implementation

To be able to remove the S3 and KMS permissions from the pivotRole policies, we need to remove the need of these permissions and then modify the pivot role permissions. At the moment the pivot role is used to access data in the S3 Buckets in the following features: 1) As registration role in Lake Formation for the datasets 2) In the Glue crawlers and profiling jobs as execution role 3) In the upload functionality 4) In the creation of Folders 5) (for a particular customer) it is used to manage S3 bucket policies 6) others that we are not aware of

Because imported dataset buckets can have any name, the S3 permissions granted apply to ALL buckets. We need to restrict the S3 permissions of the pivot role and the resources for those permissions.

1) As registration role in Lake Formation for the datasets

Because in V1.6 the dataset role was modified, we just need to:

Est. time ~ 2 days

2) In the Glue crawlers and profiling jobs as execution role

3) 4) In the upload and the create folder functionalities (see diagram above)

Option 1: Pivot role in all SDK calls

Option 2: Dataset role in SDK calls with access to S3

At first I was more inclined to go for option 1 as the pivot role has restricted access to managed buckets only needed to implement point 5). But after having a look at the code implementing option 2 is quite simple and avoids 'PutObject' permissions for the pivotRole

Est. time ~ 2 days

5) Manage bucket policies

In this case, it is the pivot role the one that needs to execute the updateBucketPolicy api calls. The task is to:

Est. time ---> not included in the initial estimations

More that we are not aware of

There might be other functionalities that access data from the backend of data.all that in theory seem unaffected, but as this is a big change could break. E.g. preview data in data.all

Est. time ~ 2 days

In total adding the changes to the pivotRole policies [~2h] and additional testing time, the resulting estimated time to implement this enhancement will take ~ 8 days πŸ†• + manage bucket policies

zsaltys commented 1 year ago

@dlpzx my concerns:

1) file upload (we wont use this feature anyway() if we use datasetAdmin doesn't sound great because it means data.all would be allowed to assume this role directly or indirectly so that's a concern..

2) can you remind me how glue crawlers are being fixed in 1.6.0?

Can you document as well:

1) how will datasetAdmin role will be created. What will create it and how. Are we granting any extra permissions to anything to create this role?

2) what will be allowed to assume datasetAdmin? Where it will be assumed exactly, where in code etc..

dlpzx commented 1 year ago

Hi @zsaltys I added the missing info in the issue description except for the last point that I need some time. I suspect that you also want to remove the pivot role form the trust policy of the dataset role right?

manjulaK commented 1 year ago

@dlpzx one thing that needs to be addressed is that pivot role should not have putbucketpolicy permission as it can potentially lead to other permissions. One way to address this would be to allow preregistration of buckets to be managed by data.all in this way not all buckets will be controlled by data.all. let me know your thoughts?

dlpzx commented 1 year ago

Hi @manjulaK, thanks for the comment! I see your point, let's see. Our target state would be a pivot role with a policy allowing access or any S3 sensitive action to the data.all created buckets + imported buckets.

What do you think about option a)? Can you confirm that you will be using auto-created pivot roles?

manjulaK commented 1 year ago

hi @dlpzx thank you very much for looking into this. I think your option a) looks good. can you kindly confirm the following assumptions: 1) cdkexec role will be used to import bucket at the time of bucket import the pivot role policy will be tweaked to add the bucket level permissions to the pivot role. 2) pivot role will only be used mainly during sharing folders,buckets ot accesspoint through API and will have s3* sort of access only to the buckets imported into data.all . Addn conditions : a)what happens if imported bucket is deleted outside of data.all b) what happens if imported bucket is deleted through data.all

dlpzx commented 1 year ago
  1. cdkexec role will be used to import b
  1. The CDK exec role is assumed by CloudFormation to create the Environment and the Dataset stack resources. When a Dataset is imported, data.all will update the auto-created pivotRole policy (part of the Environment stack). So yes, the CDK exec role grants permissions via cloudformation to the pivot role
  2. Yes, the pivot role is a data access manager (putBucketPolicy, createAccessPoint...). But s3* is something that we will work on limiting. For example, read data should not be necessary (getObject, putObject)
  3. If an imported bucket is deleted outside of data.all users should FIRST delete the imported Dataset from data.all. That will update the pivotRole policies in the environment stack
zsaltys commented 10 months ago

@dlpzx are you also updating this section in the pivotRole to to make sure that data.all only has access to imported KMS keys?

- Sid: KMS
            Action:
              - 'kms:Decrypt'
              - 'kms:Encrypt'
              - 'kms:GenerateDataKey*'
              - 'kms:PutKeyPolicy'
              - 'kms:ReEncrypt*'
              - 'kms:TagResource'
              - 'kms:UntagResource'
            Effect: Allow
            Resource:
              - '*'
dlpzx commented 10 months ago

830 Also needed for this feature

dlpzx commented 10 months ago

Merged and released with v2.1.0 πŸš€