aws / ec2-image-builder-roadmap

Public Roadmap for EC2 Image Builder.
Other
34 stars 7 forks source link

Bug Report: Race condition in AWSTOE. #66

Closed OrEisenberg closed 3 years ago

OrEisenberg commented 3 years ago

Community Note

This is a bug report. I'm really not sure where to post it since it seems that ImageBuilder (and AWS TOE, specifically), is not open source.

TLDR: As near as I can tell, there is a race condition of the sort described in this issue which is leading to AWSTOE nondeterministically failing to build docker images using ImageBuilder.

I determined this by attempting to build a docker container image using ImageBuilder and, upon it failing (which happens nondeterministically), I inspected the state of the instance which had failed to build the image. I located the dockerfile which was being used to construct the image (located in the /tmp/imagebuilder_service dir) and tried to build it manually multiple times. The result was that sometimes the build succeeded, while others it failed with the same error message I had gotten back from ImageBuilder when building it through them:

failed to download the EC2 Image Builder Component '<component arn>'. Error - operation error 
imagebuilder: GetComponent, failed to sign request: failed to retrieve credentials: failed to decode 
<imagebuilder role> EC2 IMDS role credentials, context canceled

I further tracked down the source of this error to the invocation of awstoe run in a file called <hash>-docker-build.sh which is used to actually build the image per the user's image recipe. At this point I hit a dead end since awstoe is an executable whose source does not seem to be publicly available. Given the similarities between this situation and the one described in the GitHub issue I linked above, I strongly suspect that there is a similar race condition occurring in TOE, but I am obviously not able to confirm this.

Please do let me know if there's any more information which I can provide about the bug, or if I should be posting this issue elsewhere -- this was the most suitable public GitHub repo at which I could post this.

Thanks so much in advance!

ytsssun commented 3 years ago

Hi @OrEisenberg ,

Thanks for the bug report, can I know what are the minimum steps to reproduce the bug? What Image Builder component are you using.

Right now TOE is not open sourced, but I can help with locating the issue.

OrEisenberg commented 3 years ago

Hi @ytsssun ! I appreciate the timely response. Below you can find the most minimal example I could construct using the python CDK. It is just a script (~100 lines long). I hope that it will suffice.

It's also worth mentioning probably that I've now encountered a seemingly related error which also occurs randomly. The associated error message for this new error is as follows.

failed to upload file <local path to component file> to <remote destination> with error 'operation error 
S3: PutObject, https response error StatusCode: 400, RequestID: <request id>, HostID: <host id>, 
api error AuthorizationHeaderMalformed: The authorization header is malformed; a non-empty Access 
Key (AKID) must be provided in the credential.

After running the script, just check all the log files and some of them will have failed with either this message or the aforementioned one regarding an inability to load credentials while attempting to pull down a component. As before, please do let me know if you have any questions whatsoever.


import aws_cdk as cdk
from aws_cdk import (
    aws_ec2 as ec2,
    aws_ecr as ecr,
    aws_iam as iam,
    aws_imagebuilder as ib,
    aws_s3 as s3,
)

DATA = """
schemaVersion: 1.0

phases:
  - name: build
    steps:
      - name: bash-script
        action: ExecuteBash
        inputs:
          commands:
            - eval ":"
"""
DOCKERFILE_TEMPLATE = (
    'FROM {{{ imagebuilder:parentImage }}}\n'
    'USER root\n'
    '{{{ imagebuilder:environments }}}\n'
    '{{{ imagebuilder:components }}}\n'
)

class TestStack(cdk.Stack):

    def __init__(self, scope, id):
        super().__init__(scope, id)
        # construct vpc/subnet
        vpc = ec2.Vpc(self, 'TestVpc')
        subnet = vpc.public_subnets[0]
        # construct ECR repo
        repo = ecr.Repository(self, 'TestRepo')
        target_repo = ib.CfnContainerRecipe.TargetContainerRepositoryProperty(
            repository_name=repo.repository_name, service='ECR'
        )
        # construct components
        components = [
            ib.CfnComponent(
                self,
                f'image-component-{i}',
                name=f'image-component-{i}',
                platform='Linux',
                version='0.0.0',
                data=DATA,
            )
            for i in range(10)
        ]
        components = [
            ib.CfnContainerRecipe.ComponentConfigurationProperty(
                component_arn=component.attr_arn
            )
            for component in components
        ]
        # construct recipe
        recipe = ib.CfnContainerRecipe(
            self,
            'test-container-recipe',
            name='test-container',
            container_type='DOCKER',
            dockerfile_template_data=DOCKERFILE_TEMPLATE,
            version='0.0.0',
            target_repository=target_repo,
            components=components,
            platform_override='Linux',
            parent_image='ubuntu:20.04',
        )

        # # construct logging bucket
        logging_bucket = s3.Bucket(
            self,
            id,
        )
        logging = ib.CfnInfrastructureConfiguration.LoggingProperty(
            s3_logs=ib.CfnInfrastructureConfiguration.S3LogsProperty(
                s3_bucket_name=logging_bucket.bucket_name
            )
        )
        # construct instance profile
        role = iam.Role(
            self,
            'test-role',
            managed_policies=[iam.ManagedPolicy.from_aws_managed_policy_name('AdministratorAccess')],
            assumed_by=iam.CompositePrincipal(iam.ServicePrincipal('ec2.amazonaws.com')),
        )
        instance_profile = iam.CfnInstanceProfile(
            self,
            'test-instance-profile',
            instance_profile_name='test-instance-profile',
            roles=[role.role_name],
        )
        # construct infrastructure configuration
        infra = ib.CfnInfrastructureConfiguration(
            self,
            'infra-config',
            name='infra-config',
            logging=logging,
            instance_profile_name=instance_profile.ref,
            subnet_id=subnet.subnet_id,
            security_group_ids=[vpc.vpc_default_security_group],
            key_pair='id-aqs-or',
            terminate_instance_on_failure=False,
            instance_types=['t2.micro'],
        )
        for i in range (10):
            image = ib.CfnImage(
                self,
                f'test-image-{i}',
                infrastructure_configuration_arn=infra.attr_arn,
                container_recipe_arn=recipe.attr_arn,
            )
            image.node.add_dependency(repo)

if __name__ == '__main__':
    app = cdk.App()
    TestStack(app, 'TestStack')
    app.synth()
OrEisenberg commented 3 years ago

@ytsssun The short of it is, you really can’t reliably build any container images with ImageBuilder right now. If it would be helpful for you, I’d be happy to synthesize the above CDK stack down to CloudFormation and send you that, or even translate this over to a bash script that just uses the CLI. I’m quite confident that all of these will break in the same manner (as would trying to build a container image just using the console).

Just let me know what form I can provide you with reproduction steps which would be most helpful!

OrEisenberg commented 3 years ago

UPDATE:

I have further traced down the issue to a problem in the ImageBuilder resource provider for CloudFormation. In light of this, I'd like to apologize for the fire and brimstone tone taken in my previous posts. You may find here the relevant issue I've opened in the resource provider repo.

ytsssun commented 3 years ago

Hi @OrEisenberg,

Sorry for the late reply. Thanks for the deep dive. To support your finding, I also did some experiment and here is the result.

From your script, this is the document to be executed by TOE.

schemaVersion: 1.0

phases:
  - name: build
    steps:
      - name: bash-script
        action: ExecuteBash
        inputs:
          commands:
            - eval ":"

I tried using that document in image builder's pipeline to build a docker image using the aws console. You mentioned this failure happens non-deterministically, so I ran the pipeline 10 times. From my experiment, all 10 runs succeeded.

I can try even more runs to add up the confidence, but I think Image Builder and TOE are working fine.