dlpzx commented 1 year ago

🆕 [UPDATED WITH THE FEEDBACK FROM COMMENTS] Is your idea related to a problem? Please describe. Currently the data.all pivotRole requires permission to all S3 Buckets and KMS keys in the AWS account

Describe the solution you'd like A solution in which access to the data on S3 buckets is restricted to specific roles only. We would like to prevent any data access from other accounts, especially if the pivot role gets compromised.

Analysis of roles in data.all

In data.all central account

role	used in
graphql-role	Graphql Lambda function
worker-role	Worker Lambda function
esproxy-role	OpenSearch proxy Lambda function
ecs-role	ECS tasks

In the environment accounts

In addition there is also the cdk execution role used for CDK deployments, but it is not relevant for this issue.

role	creation	permissions	trust policies (assumable by)
Pivot role	If AUTO-CREATE = True it is created as part of the environment stack, otherwise it is manually created	Read and write permissions to the TEAM-OWNED dataset buckets and to the Glue databases and tables + permissions to work with the enabled modules, but only to the TEAM-OWNED resources based on resource policies and	by `graphql-role`, `worker-role` and `ecs-role` from the central account
Environment teams' Roles	As part of the environment stack, one for each team that is added to the environment	Read and write permissions to the TEAM-OWNED dataset buckets and to the Glue databases and tables + permissions to work with the enabled modules, but only to the TEAM-OWNED resources based on resource policies and tags.	by several module services + by the `pivot-Role` in the account
Dataset Role	As part of the dataset stack, for both imported and created datasets	Read and write permissions to the dataset bucket and to the Glue database and tables of the dataset + it can write profiling results in its corresponding folder in the environment bucket	by Glue + by the `pivot-Role` in the account

In this diagram we can see all roles and the SDK calls that they need to perform. It includes the changes that we want to implement.

IAMRoles drawio

Implementation

To be able to remove the S3 and KMS permissions from the pivotRole policies, we need to remove the need of these permissions and then modify the pivot role permissions. At the moment the pivot role is used to access data in the S3 Buckets in the following features: 1) As registration role in Lake Formation for the datasets 2) In the Glue crawlers and profiling jobs as execution role 3) In the upload functionality 4) In the creation of Folders 5) (for a particular customer) it is used to manage S3 bucket policies 6) others that we are not aware of

Because imported dataset buckets can have any name, the S3 permissions granted apply to ALL buckets. We need to restrict the S3 permissions of the pivot role and the resources for those permissions.

1) As registration role in Lake Formation for the datasets

Because in V1.6 the dataset role was modified, we just need to:

[X] - Modify the Dataset stack and the role used in the registration [1h]
[X] - Modify the Dataset stack and grant permissions to the imported KMS key and S3 bucket to the dataset role [2h]
[ ] - Test how previously created Datasets are registered and can be updated. [8h]

Est. time ~ 2 days

2) In the Glue crawlers and profiling jobs as execution role

[X] Done in V1.6.0 🚀 - The execution role for the dataset crawler and profiling job is the dataset role. The pivot role calls the "startCrawler" API, Glue assumes the dataset role to crawl the data.

3) 4) In the upload and the create folder functionalities (see diagram above)

Option 1: Pivot role in all SDK calls

PROS: easy to implement. No additional trusts from the central account.
CONS: access to data. putObject permissions

Option 2: Dataset role in SDK calls with access to S3

PROS: no putObject permissions needed for pivotRole
CONS: pivotrole can still assume the dataset role, leading to the same situation.

At first I was more inclined to go for option 1 as the pivot role has restricted access to managed buckets only needed to implement point 5). But after having a look at the code implementing option 2 is quite simple and avoids 'PutObject' permissions for the pivotRole

Est. time ~ 2 days

5) Manage bucket policies

In this case, it is the pivot role the one that needs to execute the updateBucketPolicy api calls. The task is to:

[X] limit the necessary permission to data.all managed buckets only. This is done by updating the pivot role if it is auto-created when a new dataset is created/imported. We fetch the environment datasets and grant permissions to those buckets only
[ ] Ensure that the pivot role update is triggered with a dataset creation/import
[ ] Test the impact in pre-existing datasets (backwards compatibility)

Est. time ---> not included in the initial estimations

More that we are not aware of

There might be other functionalities that access data from the backend of data.all that in theory seem unaffected, but as this is a big change could break. E.g. preview data in data.all

Est. time ~ 2 days

In total adding the changes to the pivotRole policies [~2h] and additional testing time, the resulting estimated time to implement this enhancement will take ~ 8 days 🆕 + manage bucket policies

zsaltys commented 1 year ago

@dlpzx my concerns:

1) file upload (we wont use this feature anyway() if we use datasetAdmin doesn't sound great because it means data.all would be allowed to assume this role directly or indirectly so that's a concern..

2) can you remind me how glue crawlers are being fixed in 1.6.0?

Can you document as well:

1) how will datasetAdmin role will be created. What will create it and how. Are we granting any extra permissions to anything to create this role?

2) what will be allowed to assume datasetAdmin? Where it will be assumed exactly, where in code etc..

dlpzx commented 1 year ago

Hi @zsaltys I added the missing info in the issue description except for the last point that I need some time. I suspect that you also want to remove the pivot role form the trust policy of the dataset role right?

manjulaK commented 1 year ago

@dlpzx one thing that needs to be addressed is that pivot role should not have putbucketpolicy permission as it can potentially lead to other permissions. One way to address this would be to allow preregistration of buckets to be managed by data.all in this way not all buckets will be controlled by data.all. let me know your thoughts?

dlpzx commented 1 year ago

Hi @manjulaK, thanks for the comment! I see your point, let's see. Our target state would be a pivot role with a policy allowing access or any S3 sensitive action to the data.all created buckets + imported buckets.

That is the goal, but the list of S3 imported buckets can be obtained
- a) from the data.all backend directly = if a bucket is imported we add it to the pivot role policy. Similar to what we do with a env-team role.
- b) from an RDS table, DynamoDB, any other place. I am less inclined to this option as it adds more complexity
I am assuming we use auto-created pivotRoles that can be updated by data.all backend.
We need to verify if we will face size limitations in the IAM policies for the pivot role. With modularization the policies for the disabled modules are not added to the role, but still if there are hundreds of S3 imported buckets in an environment it might be a bit too much for IAM.

What do you think about option a)? Can you confirm that you will be using auto-created pivot roles?

manjulaK commented 1 year ago

hi @dlpzx thank you very much for looking into this. I think your option a) looks good. can you kindly confirm the following assumptions: 1) cdkexec role will be used to import bucket at the time of bucket import the pivot role policy will be tweaked to add the bucket level permissions to the pivot role. 2) pivot role will only be used mainly during sharing folders,buckets ot accesspoint through API and will have s3* sort of access only to the buckets imported into data.all . Addn conditions : a)what happens if imported bucket is deleted outside of data.all b) what happens if imported bucket is deleted through data.all

dlpzx commented 1 year ago

cdkexec role will be used to import b

The CDK exec role is assumed by CloudFormation to create the Environment and the Dataset stack resources. When a Dataset is imported, data.all will update the auto-created pivotRole policy (part of the Environment stack). So yes, the CDK exec role grants permissions via cloudformation to the pivot role
Yes, the pivot role is a data access manager (putBucketPolicy, createAccessPoint...). But s3* is something that we will work on limiting. For example, read data should not be necessary (getObject, putObject)
If an imported bucket is deleted outside of data.all users should FIRST delete the imported Dataset from data.all. That will update the pivotRole policies in the environment stack

zsaltys commented 10 months ago

@dlpzx are you also updating this section in the pivotRole to to make sure that data.all only has access to imported KMS keys?

- Sid: KMS
            Action:
              - 'kms:Decrypt'
              - 'kms:Encrypt'
              - 'kms:GenerateDataKey*'
              - 'kms:PutKeyPolicy'
              - 'kms:ReEncrypt*'
              - 'kms:TagResource'
              - 'kms:UntagResource'
            Effect: Allow
            Resource:
              - '*'

dlpzx commented 10 months ago

830 Also needed for this feature

dlpzx commented 10 months ago

Merged and released with v2.1.0 🚀

data-dot-all / dataall

Limit Pivot Role S3 permissions #580

Analysis of roles in data.all

In data.all central account

In the environment accounts

Implementation

1) As registration role in Lake Formation for the datasets

2) In the Glue crawlers and profiling jobs as execution role

3) 4) In the upload and the create folder functionalities (see diagram above)

5) Manage bucket policies

More that we are not aware of

830 Also needed for this feature