data-dot-all / dataall

A modern data marketplace that makes collaboration among diverse users (like business, analysts and engineers) easier, increasing efficiency and agility in data projects on AWS.
https://data-dot-all.github.io/dataall/
Apache License 2.0
221 stars 78 forks source link

Upgrade from V1.5.6 to V1.6.2 issues: AccessDenied for pivot role in ECS tasks + CodeArtifact vpc endpoint networking timeout in CodeBuild migration stage #826

Closed idpdevops closed 4 months ago

idpdevops commented 9 months ago

Upgrade from V1.5.6 to V1.6.2 fails if using baseline_codebuild_role.without_policy_updates

This is technically not a bug as the as the data.all code was used outside it's intended purpose.

The fix https://github.com/awslabs/aws-dataall/pull/774 was implemented in the main branch so didn't help with our upgrade from V1.5.6 to V1.6.2. So I merged the pipeline.py changes into the V1.6.2 code (code as per "How to reproduce") and ran the pipeline. This worked well initially and it made it past the quality gate stage, the ecr-stage and also the dev-backend-stage. However, the DB migration stage failed (see logs in "Additional context").

The command that failed (error 255 hints at a permissions issue) is

aws codeartifact login --tool pip --domain -domain-master --domain-owner YYYYYYYYYYYY --repository -pypi-store

Interestingly, this command executed perfectly fine in the QualityGate ValidateDBMigrations stage (see logs).

I tried to execute this manually, assuming the relevant role that is used in the pipeline and that also worked fine!?

Since the pipeline ran, I have also received tons of emails with data.all alarms for various accounts relating to the ecr-stage (I think), although they stopped after about 2 days.


You are receiving this email because your DATAALL platdev environment in the eu-west-2 region has entered the ALARM state, because it failed to synchronize Dataset AAAAAAAAAAAA-risk-and-control tables from AWS Glue to the Search Catalog.

Alarm Details:

The emails stopped after 2 days, data.all may have given up on whatever it was trying to do.

Here are the Cloudtrail logs:

{ "eventVersion": "1.08", "userIdentity": { "type": "AssumedRole", "principalId": "*****:149dfc83a8ff464abfd0ffd63d62deaf", "arn": "arn:aws:sts::XXXXXXXXXXXX:assumed-role/*-platdev-ecs-tasks-role/149dfc83a8ff464abfd0ffd63d62deaf", "accountId": "XXXXXXXXXXXX", "accessKeyId": "****", "sessionContext": { "sessionIssuer": { "type": "Role", "principalId": "****", "arn": "arn:aws:iam::XXXXXXXXXXXX:role/-platdev-ecs-tasks-role", "accountId": "XXXXXXXXXXXX", "userName": "-platdev-ecs-tasks-role" }, "webIdFederationData": {}, "attributes": { "creationDate": "2023-10-13T14:09:31Z", "mfaAuthenticated": "false" } } }, "eventTime": "2023-10-13T14:10:01Z", "eventSource": "sts.amazonaws.com", "eventName": "AssumeRole", "awsRegion": "eu-west-2", "sourceIPAddress": "10.0.32.173", "userAgent": "Boto3/1.24.85 Python/3.8.16 Linux/5.10.192-183.736.amzn2.x86_64 exec-env/AWS_ECS_FARGATE Botocore/1.27.85 data.all/0.5.0", "errorCode": "AccessDenied", "errorMessage": "User: arn:aws:sts::XXXXXXXXXXXX:assumed-role/***-platdev-ecs-tasks-role/149dfc83a8ff464abfd0ffd63d62deaf is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::WWWWWWWWWWWW:role/dataallPivotRole-cdk", "requestParameters": null, "responseElements": null, … } }

So some aspects are a bit of a mystery, maybe you can figure out what exactly has gone wrong.

DB Migration log:

… [Container] 2023/10/16 14:52:25 Entering phase BUILD [Container] 2023/10/16 14:52:25 Running command mkdir ~/.aws/ && touch ~/.aws/config

[Container] 2023/10/16 14:52:25 Running command echo "[profile buildprofile]" > ~/.aws/config

[Container] 2023/10/16 14:52:25 Running command echo "role_arn = arn:aws:iam::XXXXXXXXXXXX:role/***-platdev-cb-dbmigration-role" >> ~/.aws/config

[Container] 2023/10/16 14:52:25 Running command echo "credential_source = EcsContainer" >> ~/.aws/config

[Container] 2023/10/16 14:52:25 Running command aws sts get-caller-identity --profile buildprofile { "UserId": "****:botocore-session-1697467958", "Account": "XXXXXXXXXXXX", "Arn": "arn:aws:sts::XXXXXXXXXXXX:assumed-role/***-platdev-cb-dbmigration-role/botocore-session-1697467958" }

[Container] 2023/10/16 14:52:39 Running command aws codebuild start-build --project-name ***-platdev-dbmigration --profile buildprofile --region eu-west-2 > codebuild-id.json

[Container] 2023/10/16 14:52:39 Running command aws codebuild batch-get-builds --ids $(jq -r .build.id codebuild-id.json) --profile buildprofile --region eu-west-2 > codebuild-output.json

[Container] 2023/10/16 14:52:40 Running command while [ "$(jq -r .builds[0].buildStatus codebuild-output.json)" != "SUCCEEDED" ] && [ "$(jq -r .builds[0].buildStatus codebuild-output.json)" != "FAILED" ]; do echo "running migration"; aws codebuild batch-get-builds --ids $(jq -r .build.id codebuild-id.json) --profile buildprofile --region eu-west-2 > codebuild-output.json; echo "$(jq -r .builds[0].buildStatus codebuild-output.json)"; sleep 5; done running migration IN_PROGRESS running migration … IN_PROGRESS running migration FAILED

[Container] 2023/10/16 15:08:44 Running command if [ "$(jq -r .builds[0].buildStatus codebuild-output.json)" = "FAILED" ]; then echo "Failed"; cat codebuild-output.json; exit -1; fi Failed { "builds": [ { "id": "-platdev-dbmigration:556525ab-375d-481c-8eab-20358cfb3ec8", "arn": "arn:aws:codebuild:eu-west-2:XXXXXXXXXXXX:build/-platdev-dbmigration:556525ab-375d-481c-8eab-20358cfb3ec8", "buildNumber": 12, "startTime": 1697467959.665, "endTime": 1697468915.942, "currentPhase": "COMPLETED", "buildStatus": "FAILED", "projectName": "-platdev-dbmigration", "phases": [ … { "phaseType": "BUILD", "phaseStatus": "FAILED", "startTime": 1697467990.638, "endTime": 1697468915.594, "durationInSeconds": 924, "contexts": [ { "statusCode": "COMMAND_EXECUTION_ERROR", "message": "Error while executing command: aws codeartifact login --tool pip --domain -domain-master --domain-owner YYYYYYYYYYYY --repository -pypi-store. Reason: exit status 255" } ] }, { "phaseType": "POST_BUILD", "phaseStatus": "SUCCEEDED", "startTime": 1697468915.594, "endTime": 1697468915.63, "durationInSeconds": 0, "contexts": [ { "statusCode": "", "message": "" } ] }, { "phaseType": "UPLOAD_ARTIFACTS", "phaseStatus": "SUCCEEDED", "startTime": 1697468915.63, "endTime": 1697468915.708, "durationInSeconds": 0, "contexts": [ { "statusCode": "", "message": "" } ] }, { "phaseType": "FINALIZING", "phaseStatus": "SUCCEEDED", "startTime": 1697468915.708, "endTime": 1697468915.942, "durationInSeconds": 0, "contexts": [ { "statusCode": "", "message": "RequestError: send request failed\ncaused by: Post \"https://logs.eu-west-2.amazonaws.com/\": dial tcp 10.82.2.164:443: i/o timeout" } ] }, { "phaseType": "COMPLETED", "startTime": 1697468915.942 } ], "source": { "type": "NO_SOURCE", "buildspec": "{\n \"version\": \"0.2\",\n \"phases\": {\n \"build\": {\n \"commands\": [\n \"aws s3api get-object --bucket -master-code-YYYYYYYYYYYY-eu-west-2 --key source_build.zip source_build.zip\",\n \"unzip source_build.zip\",\n \"python -m venv env\",\n \". env/bin/activate\",\n \"aws codeartifact login --tool pip --domain -domain-master --domain-owner YYYYYYYYYYYY --repository -pypi-store\",\n \"pip install -r backend/requirements.txt\",\n \"pip install alembic\",\n \"export PYTHONPATH=backend\",\n \"export envname=platdev\",\n \"alembic -c backend/alembic.ini upgrade head\"\n ]\n }\n }\n}", "insecureSsl": false }, "secondarySources": [], "secondarySourceVersions": [], "artifacts": { "location": "" }, "cache": { "type": "NO_CACHE" }, "environment": { "type": "LINUX_CONTAINER", "image": "aws/codebuild/amazonlinux2-x86_64-standard:3.0", "computeType": "BUILD_GENERAL1_SMALL", "environmentVariables": [], "privilegedMode": false, "imagePullCredentialsType": "CODEBUILD" }, "serviceRole": "arn:aws:iam::XXXXXXXXXXXX:role/-platdev-cb-dbmigration-role", "logs": { "groupName": "/aws/codebuild/-platdev-dbmigration", "streamName": "556525ab-375d-481c-8eab-20358cfb3ec8", "deepLink": "https://console.aws.amazon.com/cloudwatch/home?region=eu-west-2#logsV2:log-groups/log-group/$252Faws$252Fcodebuild$252F***********-platdev-dbmigration/log-events/556525ab-375d-481c-8eab-20358cfb3ec8", "cloudWatchLogsArn": "arn:aws:logs:eu-west-2:XXXXXXXXXXXX:log-group:/aws/codebuild/-platdev-dbmigration:log-stream:556525ab-375d-481c-8eab-20358cfb3ec8" }, "timeoutInMinutes": 60, "queuedTimeoutInMinutes": 480, "buildComplete": true, "initiator": "-platdev-cb-dbmigration-role/botocore-session-1697467958", "vpcConfig": { "vpcId": "vpc-000382333791308c1", "subnets": [ "subnet-01baa5a50fc364b02", "subnet-0a98df0ce73be5447" ], "securityGroupIds": [ "sg-0e68460a73d0ac50d" ] }, "networkInterface": { "subnetId": "subnet-0a98df0ce73be5447", "networkInterfaceId": "eni-052532d0cd1bf9110" }, "encryptionKey": "arn:aws:kms:eu-west-2:XXXXXXXXXXXX:alias/aws/s3" } ], "buildsNotFound": [] }

[Container] 2023/10/16 15:08:44 Command did not exit successfully if [ "$(jq -r .builds[0].buildStatus codebuild-output.json)" = "FAILED" ]; then echo "Failed"; cat codebuild-output.json; exit -1; fi exit status 255 [Container] 2023/10/16 15:08:44 Phase complete: BUILD State: FAILED [Container] 2023/10/16 15:08:44 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: if [ "$(jq -r .builds[0].buildStatus codebuild-output.json)" = "FAILED" ]; then echo "Failed"; cat codebuild-output.json; exit -1; fi. Reason: exit status 255 [Container] 2023/10/16 15:08:44 Entering phase POST_BUILD [Container] 2023/10/16 15:08:44 Phase complete: POST_BUILD State: SUCCEEDED [Container] 2023/10/16 15:08:44 Phase context status code: Message:

Quality Gate DB Validation log:


… [Container] 2023/10/13 13:32:05 Entering phase BUILD [Container] 2023/10/13 13:32:05 Running command aws codeartifact login --tool pip --repository -pypi-store --domain -domain-master --domain-owner YYYYYYYYYYYY Successfully configured pip to use AWS CodeArtifact repository [https://***********-domain-master-YYYYYYYYYYYY.d.codeartifact.eu-west-2.amazonaws.com/pypi/***********-pypi-store/](https://***********-domain-master- YYYYYYYYYYYY.d.codeartifact.eu-west-2.amazonaws.com/pypi/***-pypi-store/) Login expires in 12 hours at 2023-10-14 01:32:18+00:00

[Container] 2023/10/13 13:33:41 Phase complete: BUILD State: SUCCEEDED [Container] 2023/10/13 13:33:41 Phase context status code: Message: [Container] 2023/10/13 13:33:41 Entering phase POST_BUILD [Container] 2023/10/13 13:33:41 Phase complete: POST_BUILD State: SUCCEEDED [Container] 2023/10/13 13:33:41 Phase context status code: Message:

dlpzx commented 9 months ago

Hi @idpdevops, thanks for opening an issue. Even if it is "outside the intended use" we will try to help you on fixing your deployment. From the logs I can distinguish 2 different issues:

Issue 1 --> ECS glue sync task failure

Root cause of the issue: access denied for ecs-task-role to AssumeRole arn:aws:iam::ZZZZZZZZZZZZ:role/dataallPivotRole-cdk

This type of error (Access Denied for AssumeRole) has 2 possible causes.

  1. the ecs-task-role lacks AssumeRole permissions
  2. the arn:aws:iam::ZZZZZZZZZZZZ:role/dataallPivotRole-cdk lacks permissions for the ecs-task-role in its trust policy

In v1.6 we focused on security hardening features, including the hardening of the trust policies on the pivot role, so that is the first thing that I would verify. We moved the external ID used in the trusted accounts to SSM, so its value has been updated.

In case you don't know what the external ID is, here is some documentation of why cross-account role assumption should use external IDs

Issue 2 --> Migration CodeBuild stage failure

There are better logs to debug this issue. Migrations are the way data.all has to update the RDS tables schema when new features are introduced. We want to include the update of RDS as part of our CICD pipeline (tooling account), but our RDS database is deployed in the central deployment account. To be able to modify the RDS database we deploy a CodeBuild project in the central deployment account, something like prefix-env-dbmigration, this is the "real migration codebuild stage" where we run the alembic commands. In the tooling account the migration stage (the "false migration stage" is just calling this real CodeBuild stage in the other account. Your logs are the fetching of the status, but as you probably noticed, they do not provide much info about the actual error.

What we need is to go to the central account > Codebuild> Projects > search for prefix-env-dbmigration and check the "real migration" logs

I hope this long test was helpful, please reach out with any new findings, logs or questions that might arise :)

dlpzx commented 9 months ago

Update from offline troubleshooting

Issue 1: Solved 👍

The first hypothesis was that:

However, they updated the environment stacks in another way, they set the parameter "enable_update_dataall_stacks_in_cicd_pipeline": true in the cdk.json file. After this change, they did not receive more errors of access denied. This is another way of forcing updates in environment and dataset stacks as part of the CICD. It ensures integrity of the application.

dlpzx commented 9 months ago

Hi @idpdevops I renamed the issue to reflect the actual issues in same someone runs into the same challenges.

Update from offline troubleshooting

Issue 2: Understood - requires custom development for customer particular networking 👍

When checking the logs in deployment account > Codebuild > migration project, we could not see any logs. Instead in the CodeBuild phase details we can check that the issue is on the networking.

Screenshot 2023-10-27 at 13 15 38

From v1.5 to v1.6 there are changes in the way packages are installed, v1.6 ensures that all packages are always installed through AWS CodeArtifact. To log in, there is a command that tries to hit a Codeartifact VPC endpoint.

Given the logs and this particular change between versions, we could conclude that it was a networking issue between the CodeBuild migration project and the CodeArtifact VPC endpoint.

For the default cdk.json configuration, data.all creates those VPC endpoints and configures the VPC and the CodeBuild security group with outbound rules to the security group of the VPC endpoints. In this case however, the customer had its own internal process to create VPCs and VPC endpoints. The VPC was created in the CodeBuild (data.allDeploymentAccount) account. The VPC endpoints are deployed in a VPC in a different account (SharedVPCEAccount). Traffic between both VPCs is handled by transit Gateway.

In the cdk.json the created VPC in data.allDeploymentAccount was introduced as vpc_id in the deployment environment. The problem is that for the parameter vpc_endpoints_sg, the security group in the SharedVPCEAccount is not accessible because it is in another account VPC. Instead the customer introduced a generic security group.

We manually added the IP range for the VPC in the SharedVPCEAccount in the outbound rules for the CodeBuild security group and that solved the issue. Nevertheless, this is a workaround and we want to address this scenario in a more consistent way.

Option 1: VPC-peering and NO changes to data.all

Option 2: new cdk.json parameter in data.all for VPC endpoints - VPC range outbound rules

I personally like working with security groups more than with IP ranges. It is more restrictive and readable, but maybe you got limitations and option 1 is not possible.

@idpdevops let me know your thoughts.

idpdevops commented 9 months ago

@dlpzx

Thank you very much for the analysis and the proposed options.

I prefer the VPC peering option but will have to check how this could work for us.

One thing also to note is that this command in the build project:

aws codeartifact login --tool pip --domain ****-domain-master --domain-owner * --repository **-pypi-store --endpoint-url *.api.codeartifact.eu-west-2.vpce.amazonaws.com

needed the addition of the --endpoint-url parameter to work.

However, even with the changes to the code-artifact command and the DB migration security group egress rules, the next command

pip install -r backend/requirements.txt

still failed because pip tries to access ****-domain-master-*****.d.codeartifact.eu-west-2.amazonaws.com and that request was rejected:

WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f2c440b9820>, 'Connection to access ****-domain-master-*.d.codeartifact.eu-west-2.amazonaws.com timed out. (connect timeout=15)')': /pypi/idpdataall-pypi-store/simple/ariadne/ ... WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f2c440b9fa0>, 'Connection to access ****-domain-master-*.d.codeartifact.eu-west-2.amazonaws.com timed out. (connect timeout=15)')': /pypi/idpdataall-pypi-store/simple/ariadne/ ERROR: Could not find a version that satisfies the requirement ariadne==0.17.0 (from versions: none) ERROR: No matching distribution found for ariadne==0.17.0

The DNS name gets resolved as follows:

PIP networking issue

That is obviously outside the IP address range for the VPC Endpoints we opened up in the DB migration security group egress rules, so the request is rejected:

2 392868065641 eni-***** 10.0.32.137 35.*.***.143 57952 443 6 1 60 1698403869 1698403895 REJECT OK

Interestingly, a reverse lookup on 35...143 points to ec2-35---143.eu-west-2.compute.amazonaws.com, so this seems to be the EC2 resource that runs the codeartifact stuff.

This resource seems to be separate to the 2 codeartifact VPC Endpoints.

I then added an outgoing rule to the DB migration SG to let all HTTPS/443 traffic out and that fixed the problem (I am aware that this unlikely to be the appropriate solution to the problem but at least it verified that there was a problem and showed that there is a networking issue:

buildSuccess

So overall, there seem to be 3 issues that need to be addressed: 1) Access to VPC Endpoint that are not in the main dataall VPC (VPC Peering or changes to SG rules) 2) Configuration of endpoints-url parmeter for codeartifact command and its use in the build script 3) Routing of the requests to the codeartifact EC2s

dlpzx commented 4 months ago

Hi @idpdevops are you still facing issues?

idpdevops commented 4 months ago

Hi, our requirement that drove the use of Data.all has gone away, so the issue has gone away, too.

Kind regards,

Steffen

From: dlpzx @.> Sent: Tuesday, March 12, 2024 4:05 PM To: data-dot-all/dataall @.> Cc: iDP DevOps Alerts @.>; Mention @.> Subject: Re: [data-dot-all/dataall] Upgrade from V1.5.6 to V1.6.2 issues: AccessDenied for pivot role in ECS tasks + CodeArtifact vpc endpoint networking timeout in CodeBuild migration stage (Issue #826)

Hi @idpdevopshttps://github.com/idpdevops are you still facing issues?

— Reply to this email directly, view it on GitHubhttps://github.com/data-dot-all/dataall/issues/826#issuecomment-1992013334, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BBFKNXDBX3LTGU6DBGGRQETYX4RRXAVCNFSM6AAAAAA6NVZLECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJSGAYTGMZTGQ. You are receiving this because you were mentioned.Message ID: @.**@.>>

dlpzx commented 4 months ago

Thanks for responding to the issue. Do not hesitate to reach out if you need any support.