awslabs / aws-serverless-data-lake-framework

Enterprise-grade, production-hardened, serverless data lake on AWS
https://sdlf.workshop.aws/
MIT No Attribution
403 stars 133 forks source link

Sequence of deployment for multi environment deployment #7

Closed bkanzki-onica closed 3 years ago

bkanzki-onica commented 3 years ago

@trejas When trying to deploy the multi environment sdlf in this workshop. I run into errors.

my config file looks like this: `[default] region = us-east-1 output=json

[profile bdlf-dev] account=TTTTTTTTT region = us-east-1 output = json

[profile bdlf-qa] account=ZZZZZZZZZ region = us-east-1 output = json

[profile bdlf-prod] account=YYYYYYYY region = us-east-1 output = json

[profile bdlf-devops-dev] account=XXXXXXXXXXX role_arn=arn:aws:iam::XXXXXXXXXXX:role/big-data-labs-data-engineer-dev source_profile=bdlf-dev region = us-east-1 output = json

[profile bdlf-devops-qa] account=XXXXXXXXXXX role_arn=arn:aws:iam::XXXXXXXXXXX:role/big-data-labs-data-engineer-qa source_profile=bdlf-qa region = us-east-1 output = json

[profile bdlf-devops-prod] account=XXXXXXXXXXX role_arn=arn:aws:iam::XXXXXXXXXXX:role/big-data-labs-data-engineer-prod source_profile=bdlf-prod region = us-east-1 output = json

[profile bdlf-devops-main] region = us-east-1 output = json`

step 1 was to run this command:

./deploy.sh -s bdlf-devops-main -r us-east-1 -f

Step 2 is to run this command:

./deploy.sh -s bdlf-devops-main -t bdlf-dev -r us-east-1 -e dev -o -c

But every time I run it, I get: An error occurred (ValidationError) when calling the DescribeStackEvents operation: Stack [sdlf-cicd-child-foundations] does not exist and the stack's status goes to ROLLBACK_COMPLETE and cannot be updated afterwards. None of the resources get created in the child account.

Could you help with this?

jaidisido commented 3 years ago

Hi @bkanzki-onica, so if I understand correctly, resources are correctly deployed in the CICD account but not in the child account?

Can you please confirm that you can see two stacks in CREATE_COMPLETE in the CICD account like in this picture:

sdlf-cicd-devops

Then in the child account, for the sdlf-cicd-child-foundations stack which is in ROLLBACK_COMPLETE could you identify the first error that you see in the Events section of that CloudFormation stack and provide it here?

It's most likely something to do with your role's permissions and searching in the Events of a failed stack would inform you about that

bkanzki-onica commented 3 years ago

Yes I do see two stacks in CICD account. In the child account I have only this under events: Screen Shot 2020-11-20 at 6 57 13 PM

jaidisido commented 3 years ago

Could you try to:

  1. Manually delete the sdlf-cicd-child-foundations stack in the child account

  2. Manually recreate the same stack by uploading this template from the repository into CloudFormation. In the parameters, it will ask for these two inputs:

    • pSharedDevOpsAccountId: The 12 digit AWS account ID of the CICD account
    • pSharedDevOpsAccountKmsKeyArn: The KMS key arn from the CICD account. It can be obtained from the Outputs section of the sdlf-cicd-shared-foundations-dev stack in the CICD account under oKMSKeyId KMSKey
  3. As soon as the stack launches, please monitor it for any issues and let us know what you encounter

bkanzki-onica commented 3 years ago

Hi I followed your instructions and this is what I got: Invalid Principal sdlf-cicd-team-repos-rTeamReposCodeBuildRolexxxxx Arn in sdlf-cicd-child-foundations stack I can't find that role in my list of roles

jaidisido commented 3 years ago

@bkanzki-onica something strange is that the sdlf-cicd-team-repos-rTeamReposCodeBuildRolexxxxx is defined in the CICD AWS account, not in the child account. The role is not defined in the sdlf-cicd-child-foundations template, so I don't understand how the child account can even detect this role... Could it be that this child account was previously used for an initial deployment of SDLF (where CICD and Child accounts were combined into one)?

In the Lake Formation console of the child account, could you also check the Admins and database creators section for any reference to this role?

bkanzki-onica commented 3 years ago

@jaidisido the child account was used for an initial deployment of SDLF, then I deleted everything to deploy the multi environment account. In the Lake Formation console there is no reference to that. I could recreate that role, do you know what permissions came with it?

jaidisido commented 3 years ago

That would explain it partly. It seems that Lake Formation is still hanging over this role somehow, although it was previously deleted. What remains a mystery is why the role ended up in Lake Formation in the first place. At no point does SDLF adds it to Lake Formation, so it must have been added manually.

I am not sure if recreating the role would help, but it cannot hurt to try. The role is defined here, and not sure if you need to fully recreate it or just having a role with the same name would be enough.

A more radical solution would be to consider deploying in another (clean) child account. Appreciate it might not be possible however.

bkanzki-onica commented 3 years ago

I don't understand, This is the template for the sdlf-cicd-team-repos. It creates that role in the devops account. It's only the child account that it doesn't create it. Why? What is missing to deploy that role there? These accounts were empty before. So if by accident someone deletes that role, your stack doesn't recreate it? Screen Shot 2020-11-23 at 11 00 32 AM

Screen Shot 2020-11-23 at 11 00 52 AM

bkanzki-onica commented 3 years ago

How can that role be deleted from lake formation?

jaidisido commented 3 years ago

As you say, the sdlf-cicd-team-repos-rTeamReposCodeBuildRolexxxxx is created through the sdlf-cicd-team-repos template. So here is my understanding of the events that led to the issue:

  1. The current child AWS account was first used for an initial SDLF deployment. Because in this first deployment all resources where provisioned in the same AWS account, the sdlf-cicd-team-repos-rTeamReposCodeBuildRolexxxxx role was created via the sdlf-cicd-team-repos template
  2. This sdlf-cicd-team-repos-rTeamReposCodeBuildRolexxxxx also ended up in Lake Formation. And this is the part that I don't understand since at no point should this happen in the framework unless someone adds it manually for some reason
  3. At some point, the child account was cleaned and all resources destroyed (including the role). But for some reason, that role was still somehow registered in Lake Formation (again not sure why)
  4. The SDLF is once more deployed, but this time over two accounts. In this new configuration, the sdlf-cicd-team-repos-rTeamReposCodeBuildRolexxxxx role was created in the CICD account because that is where the sdlf-cicd-team-repos stack is defined. It should NOT appear in the child account at this point
  5. Thus the error you see in the child account leads me to believe that Lake Formation is still hanging on the old version of that IAM role
bkanzki-onica commented 3 years ago

I was finally able to delete it. and the sdlf-cicd-child-foundations has been created completely which gives me access to codepipeline and codebuild. However when doing a push after modifying the parameters-dev.json file, I get the following error:

[Container] 2020/11/23 17:18:08 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: if [ "${CODEBUILD_BUILD_NUMBER}" -gt "1" ]; then

99 | ./deploy.sh

bkanzki-onica commented 3 years ago

I was able to manually deploy the resources by runnind this command in the sdlf-foundations folder: ./deploy.sh -n sdlf-cicd-child-foundations -s sdlf-cfn-artifacts-us-east-1-XXXXXXXXXXXXX -p bdlf-dev

But the previously codebuild and codepipeline resources disapeared and got deleted. Would you know why?

jaidisido commented 3 years ago

They got deleted because you used the same name (sdlf-cicd-child-foundations) when deploying your sdlf-foundations resources, effectively asking CloudFormation to replace the resources that were previously defined in the existing sdlf-cicd-child-foundations stack. The command should have been: ./deploy.sh -n sdlf-foundations -s sdlf-cfn-artifacts-us-east-1-XXXXXXXXXXXXX -p bdlf-dev

bkanzki-onica commented 3 years ago

@jaidisido That command does deploy the stack but it gets stuck at the and can't create the glue catalog and common policy:

Screen Shot 2020-11-23 at 9 28 54 PM Screen Shot 2020-11-23 at 9 14 17 PM

In the terminal I get also: An error occurred (ParameterNotFound) when calling the GetParameter operation: upload failed: scripts/deequ/jar/deequ-1.0.3-RC1.jar to s3:///deequ/jars/deequ-1.0.3-RC1.jar Parameter validation failed: Invalid bucket name "": Bucket name must match the regex "^[a-zA-Z0-9.-]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:s3:[a-z-0-9]+:[0-9]{12}:accesspoint[/:][a-zA-Z0-9-]{1,63}$" fatal error: Parameter validation failed: Invalid bucket name "": Bucket name must match the regex "^[a-zA-Z0-9.-]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:s3:[a-z-0-9]+:[0-9]{12}:accesspoint[/:][a-zA-Z0-9-]{1,63}$"

And then it goes to RollBack complete

UPDATE: Tried a few times. it rolls back and deletes everything all the time

bkanzki-onica commented 3 years ago

They got deleted because you used the same name (sdlf-cicd-child-foundations) when deploying your sdlf-foundations resources, effectively asking CloudFormation to replace the resources that were previously defined in the existing sdlf-cicd-child-foundations stack. The command should have been: ./deploy.sh -n sdlf-foundations -s sdlf-cfn-artifacts-us-east-1-XXXXXXXXXXXXX -p bdlf-dev

What's interesting is that when I deploy it with that stack, Codebuild and codepipeline disapear, but all the resources get created properly, which is not the case when I call it sdlf-foundations. There seems to be a permission issue in that stack

bkanzki-onica commented 3 years ago

Is it possible to setup a meeting to discuss this?

jaidisido commented 3 years ago

@jaidisido That command does deploy the stack but it gets stuck at the and can't create the glue catalog and common policy:

Screen Shot 2020-11-23 at 9 28 54 PM Screen Shot 2020-11-23 at 9 14 17 PM

In the terminal I get also: An error occurred (ParameterNotFound) when calling the GetParameter operation: upload failed: scripts/deequ/jar/deequ-1.0.3-RC1.jar to s3:///deequ/jars/deequ-1.0.3-RC1.jar Parameter validation failed: Invalid bucket name "": Bucket name must match the regex "^[a-zA-Z0-9.-]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:s3:[a-z-0-9]+:[0-9]{12}:accesspoint[/:][a-zA-Z0-9-]{1,63}$" fatal error: Parameter validation failed: Invalid bucket name "": Bucket name must match the regex "^[a-zA-Z0-9.-]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:s3:[a-z-0-9]+:[0-9]{12}:accesspoint[/:][a-zA-Z0-9-]{1,63}$"

And then it goes to RollBack complete

UPDATE: Tried a few times. it rolls back and deletes everything all the time

There seems to be a number of different issues here.

  1. I would recommend deploying the infrastructure using the CICD resources (CodePipeline, CodeBuild...) not running the deploy script manually. The error you are seeing about the Failed Get Parameter is most likely due to your environment missing jq, a utility used to query json files. This utility is installed by default in CodeBuild environments, and you would need to run echo y | sudo yum install jq to install it on your environment

  2. It seems that resources are still lingering from your very first SDLF deployment. For instance, DynamoDB tables are retained even when the stack is deleted. So if you try to redeploy the Dynamo stack it will fail because the tables are already there. Given that this account is polluted from the previous deployment, I would strongly recommend testing in a different one and using the CICD instead of your own environment