aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.51k stars 3.85k forks source link

aws-s3-assets: Object keys don't preserve the full extension name for Python project #30257

Open marcelokscunha opened 4 months ago

marcelokscunha commented 4 months ago

Describe the bug

Same as #12699 but for Python project.

Neither simple upload nor bundling asset with command in Docker container maintain full extension (HASH.tar.gz -> HASH.gz).

"tar.gz" file is expected from SageMaker endpoint.

Expected Behavior

I expect the s3 object key is ended with tar.gz

Current Behavior

When using aws-s3-assets to bundle the asset HASH.tar.gz by executing a command in a Docker container and upload to S3 it is renamed to HASH.gz

Same occurs when passing HASH.tar.gz directly and is uploaded to S3 but renamed to HASH.gz

Reproduction Steps

Reproduction Steps for Docker (same behavior when omitting bundling and passing path="path/my.tar.gz"

Code:

import pathlib

import aws_cdk as cdk
from aws_cdk import (
    Stack,
    aws_s3_assets,
)
from constructs import Construct

HERE = pathlib.Path(__file__).parent

class CdkErrorStack(Stack):

    def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)

        self.models_s3_artifact = aws_s3_assets.Asset(
            self,
            "Models",
            path=str(HERE),
            bundling=cdk.BundlingOptions(
                image=cdk.DockerImage.from_registry("debian"),
                command=[
                    "bash", "-c", "tar -czvf /asset-output/dummy.tar.gz --files-from /dev/null"
                ],
                output_type=cdk.BundlingOutput.ARCHIVED
            )
        )
        cdk.CfnOutput(self, "TarURI", value=self.models_s3_artifact.s3_object_url)

It is possible to see that locally in cdk.out the tarball has correct extensions: error

But in S3 and CFN output is wrong: errors3

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.142.1

Framework Version

No response

Node.js Version

v18.17.1

OS

MacOSX

Language

Python

Language Version

Python 3.9.12

Other information

No response

ashishdhingra commented 4 months ago

Reproducible using below code:

import pathlib

import aws_cdk as cdk
from aws_cdk import (
    Stack,
    aws_s3_assets,
    aws_lambda
)
from constructs import Construct

HERE = pathlib.Path(__file__).parent

class Issue30257Stack(Stack):

    def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)

        self.models_s3_artifact = aws_s3_assets.Asset(
            self,
            "Models",
            path=str(HERE),
            bundling=cdk.BundlingOptions(
                image=cdk.DockerImage.from_registry("debian"),
                command=[
                    "bash", "-c", "tar -czvf /asset-output/dummy.tar.gz --files-from /dev/null"
                ],
                output_type=cdk.BundlingOutput.ARCHIVED
            )
        )
        lambdaFunction = aws_lambda.Function(
           self, "TestLambdaFunction",
           runtime=aws_lambda.Runtime.NODEJS_18_X,
           code=aws_lambda.Code.from_asset("lambda"),
           handler="hello.handler",
           environment=dict(
                S3_BUCKET_NAME=self.models_s3_artifact.s3_bucket_name,
                S3_OBJECT_KEY=self.models_s3_artifact.s3_object_key,
                S3_OBJECT_URL=self.models_s3_artifact.s3_object_url)

        )
        cdk.CfnOutput(self, "TarURI", value=self.models_s3_artifact.s3_object_url)

Output of cdk synth below (notice the value in Outputs):

Resources:
  TestLambdaFunctionServiceRole0C9E0634:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Action: sts:AssumeRole
            Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
        Version: "2012-10-17"
      ManagedPolicyArns:
        - Fn::Join:
            - ""
            - - "arn:"
              - Ref: AWS::Partition
              - :iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
    Metadata:
      aws:cdk:path: Issue30257Stack/TestLambdaFunction/ServiceRole/Resource
  TestLambdaFunctionC089708A:
    Type: AWS::Lambda::Function
    Properties:
      Code:
        S3Bucket:
          Fn::Sub: cdk-hnb659fds-assets-${AWS::AccountId}-${AWS::Region}
        S3Key: 3e4845b15c28e9e3882e7f9016a42054e94d051bd8662ff76e8e9f9d0ae50c48.zip
      Environment:
        Variables:
          S3_BUCKET_NAME:
            Fn::Sub: cdk-hnb659fds-assets-${AWS::AccountId}-${AWS::Region}
          S3_OBJECT_KEY: 09fdccd2f59ef7a5487e651683b13996cb2ce8e4d258c878b1325cbacea20082.gz
          S3_OBJECT_URL:
            Fn::Sub: s3://cdk-hnb659fds-assets-${AWS::AccountId}-${AWS::Region}/09fdccd2f59ef7a5487e651683b13996cb2ce8e4d258c878b1325cbacea20082.gz
      Handler: hello.handler
      Role:
        Fn::GetAtt:
          - TestLambdaFunctionServiceRole0C9E0634
          - Arn
      Runtime: nodejs18.x
    DependsOn:
      - TestLambdaFunctionServiceRole0C9E0634
    Metadata:
      aws:cdk:path: Issue30257Stack/TestLambdaFunction/Resource
      aws:asset:path: asset.3e4845b15c28e9e3882e7f9016a42054e94d051bd8662ff76e8e9f9d0ae50c48
      aws:asset:is-bundled: false
      aws:asset:property: Code
  CDKMetadata:
    Type: AWS::CDK::Metadata
    Properties:
      Analytics: v2:deflate64:H4sIAAAAAAAA/02MUQqDMBBEz+J/3BLTC1Shvy16AFljardqUtwNUsS7F5VCv94wM7wM9DkDneDMqW37dKAGlkrQ9gpnrhc2NTI7YbhsUGwgj7Z3kiM7NeDYtAjLNXorFLwqHv6XV0U4wlKGwW31xnVVu6US7Mh3qnQc4mT3/RblHWV//rVF8C0dtvtHnsGfDGgNJnkxUTpFLzQ6KA9+AZvg6wnMAAAA
    Metadata:
      aws:cdk:path: Issue30257Stack/CDKMetadata/Default
    Condition: CDKMetadataAvailable
Outputs:
  TarURI:
    Value:
      Fn::Sub: s3://cdk-hnb659fds-assets-${AWS::AccountId}-${AWS::Region}/09fdccd2f59ef7a5487e651683b13996cb2ce8e4d258c878b1325cbacea20082.gz
...
...

Also reproducible using TypeScript code below (create assets/dummy.tar.gz via command tar -czvf dummy.tar.gz --files-from /dev/null):

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as s3assets from 'aws-cdk-lib/aws-s3-assets';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as path from 'path';

export class TypescriptStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const lambdaAsset = new s3assets.Asset(this, "LambdaAsset", {
      path: path.join(path.resolve(__dirname, '..'), 'assets/dummy.tar.gz')
    });

    new lambda.Function(this, "myLambdaFunction", {
      code: lambda.Code.fromAsset('lambda'),
      runtime: lambda.Runtime.NODEJS_18_X,
      handler: "index.lambda_handler",
      environment: {
        'S3_BUCKET_NAME': lambdaAsset.s3BucketName,
        'S3_OBJECT_KEY': lambdaAsset.s3ObjectKey,
        'S3_OBJECT_URL': lambdaAsset.s3ObjectUrl
      }
    });
    new cdk.CfnOutput(this, "AssetURI", {
      value: lambdaAsset.s3ObjectUrl
    });
  }
}

The related issue https://github.com/aws/aws-cdk/issues/12699 was fixed for CDK v1 (which is depreciated now). Perhaps the same fix needs to be ported for CDK v2.

pahud commented 3 months ago

OK I read this from SageMaker doc:

SageMaker Neo requires models to be stored as a compressed TAR file. Repackage it as a compressed TAR file (*.tar.gz):

And that model has to be uploaded to an Amazon S3 customer-managed bucket. Is this your use case?

If that is the case, I don't think you should use S3 Assets like that. You should use s3 bucket deployment this way:

import {
aws_s3 as s3,
aws_s3_deployment as s3d,
} from 'aws-cdk-lib';

export class DummyStack extends Stack {
    constructor(scope: Construct, id: string, props: StackProps) {
        super(scope, id, props);
    const sageMakerModelBucket = new s3.Bucket(this, 'SageMakerModelBucket', {
      removalPolicy: RemovalPolicy.DESTROY,
    })
    const deployed = new s3d.BucketDeployment(this, 'SageMakerModelDeployment', {
      destinationBucket: sageMakerModelBucket,
      sources: [s3d.Source.asset(path.join(__dirname, '../assets'))],
      extract: true,
    })

   new CfnOutput(this, 'SageMakerModelBucketOutput', { value: `s3://${sageMakerModelBucket.bucketName}`});
  };
}
  1. prepare your dummy.tar.gz in local assets dir:
    % ls -al assets/dummy.tar.gz 
  2. npx cdk deploy this stack

You should see the bucketName in the output. Try aws s3 ls it:

Outputs:
dummy-stack.SageMakerModelBucketOutput = s3://dummy-stack-sagemakermodelbucket6aea4c01-xxx
% aws s3 ls s3://dummy-stack-sagemakermodelbucket6aea4c01-xxx
2024-05-18 08:31:30       1610 dummy.tar.gz

That's it!

Generally, aws-s3-assets is to create CDK assets needed by CDK App, for example, lambda functions or anything that requires staging assets to be stored in the cdk file assets bucket created in cdk bootstrap in the format of cdk-hnb659fds-assets-<YOUR_ACCOUNT_ID>-us-east-1. Assets in that bucket is fully managed by CDK and we should not expect the assets to be stored in a customized way. In your use case, you should deploy the local tar.gz into a user-managed bucket rather than the CDK-managed assets bucket. You literally can specify ANY buckets you manage and deploy that local tar.gz into that bucket exactly as the local filename.

I hope it clarifies. Let me know if it works for you.

marcelokscunha commented 3 months ago

Hi @pahud,

I'm deploying a custom PyTorch inference logic as part of a SageMaker endpoint. It's a standard flow as described here, where I've created a custom inference logic.

I have something similar to the described in the doc. Note that I didn't compressed all into model.tar.gz yet, and I was hoping I could use aws_s3_assets.Asset with that end (as this is only for packing and deploying the endpoint I don't want to keep running multiple tar -czvf ... commands and the Asset's functionality fits perfectly doing that for us):

# Current local structure
model_dir
|- model.pth
|- code/
  |- inference.py
  |- requirements.txt  # only for versions 1.3.1 and higher

I've developed the model locally and now inference is working locally (have the pre-trained weights model.pth and the working inference.py for custom inference logic). To deploy to the endpoint one must create the tarball, upload to S3 and create the endpoint.