aws-samples / comfyui-on-amazon-sagemaker

This project demonstrates how to generate images using Stable Diffusion or FLUX.1 models by hosting ComfyUI on Amazon SageMaker inference endpoint.
MIT No Attribution
27 stars 5 forks source link

CREATE_FAILED #2

Closed ThorstenBux closed 1 month ago

ThorstenBux commented 1 month ago

Hi,

I'm getting:

       },
        {
            "StackId": "arn:aws:cloudformation:us-west-2:147220985702:stack/comfyui/25b5d430-5886-11ef-b996-06fb7d7e3449",
            "EventId": "ComfyUIEndpoint-DELETE_IN_PROGRESS-2024-08-12T08:41:53.666Z",
            "StackName": "comfyui",
            "LogicalResourceId": "ComfyUIEndpoint",
            "PhysicalResourceId": "arn:aws:sagemaker:us-west-2:147220985702:endpoint/comfyui",
            "ResourceType": "AWS::SageMaker::Endpoint",
            "Timestamp": "2024-08-12T08:41:53.666000+00:00",
            "ResourceStatus": "DELETE_IN_PROGRESS",
            "EventId": "ComfyUIEndpoint-CREATE_FAILED-2024-08-12T08:41:51.395Z",
            "StackName": "comfyui",
            "LogicalResourceId": "ComfyUIEndpoint",
            "PhysicalResourceId": "arn:aws:sagemaker:us-west-2:147220985702:endpoint/comfyui",
            "ResourceType": "AWS::SageMaker::Endpoint",
            "Timestamp": "2024-08-12T08:41:51.395000+00:00",
            "ResourceStatus": "CREATE_FAILED",
            "ResourceStatusReason": "CannotStartContainerError. Please ensure the model container for variant comfyui-sample starts correctly when invoked with 'docker run <image> serve'",
            "ResourceProperties": "{\"EndpointName\":\"comfyui\",\"EndpointConfigName\":\"comfyui-sample\"}"

Could you please assist in resolving the issue.

khchan123 commented 1 month ago

Hi @ThorstenBux. I can see the following error reason in the provided message:

CannotStartContainerError. Please ensure the model container for variant comfyui-sample starts correctly when invoked with 'docker run serve'",

This error occurs when SageMaker fails to start the container to prepare the container for inference. Could you attach the CloudWatch log in log group /aws/sagemaker/Endpoints/comfyui? You should see a log stream **comfyui-sample/i-xxxxxxxxxxxx***?

Besides, do you see any error message during when running deploy.sh?

blakegreendev commented 1 month ago

I'm also seeing this error message. From the deploy.sh script, "Failed to create/update the stack. Run the following command to fetch the list of events leading up to the failure aws cloudformation describe-stack-events --stack-name comfyui" That's when I see the "CannotStartContainerError. Please ensure the model container for variant comfyui-sample starts correctly when invoked with 'docker run serve'" and from the CloudFormation console.

I'm not seeing the Log group you are referring to either...

FWIW, I'm running this from my M1 Macbook Pro. Could that have something to do with how the container image is built?

ThorstenBux commented 1 month ago

Hi @khchan123 , thank your for your swift reply. I can't find the mentioned CloudWatch log entry. These are the last lines from deploy.sh

/Users/thorstenbux/repos/rpr/sandbox/comfyui-on-amazon-sagemaker
updating: lambda_function.py (deflated 70%)
updating: workflow/ (stored 0%)
updating: workflow/workflow_api.json (deflated 75%)
  adding: workflow/flux1-dev-fp8-ckpt.json (deflated 72%)
  adding: workflow/flux1-schnell-fp8-ckpt.json (deflated 71%)
upload: ./lambda-.zip to s3://comfyui-sagemaker-147220985702-us-west-2/lambda/lambda-.zip
/Users/thorstenbux/repos/rpr/sandbox/comfyui-on-amazon-sagemaker
Deploying CloudFormation stack...

Waiting for changeset to be created..
Waiting for stack create/update to complete

Failed to create/update the stack. Run the following command
to fetch the list of events leading up to the failure
aws cloudformation describe-stack-events --stack-name comfyui
khchan123 commented 1 month ago

Thanks @ThorstenBux and @blakegreendev for the update. I just tried another clean deployment from an EC2 in us-west-2 region and it works fine, so I believe it is some environment-specific issue.

Please try to print the first failed event in CloudFormation using the following command. You may use the Detect root cause button in the AWS console (under Events tab of CloudFormation stack) to help locate the root cause.

aws cloudformation describe-stack-events \
    --stack-name comfyui \
    --query 'StackEvents[?ResourceStatus==`CREATE_FAILED` || ResourceStatus==`UPDATE_FAILED`].{
        LogicalResourceId: LogicalResourceId,
        ResourceType: ResourceType,
        ResourceStatus: ResourceStatus,
        ResourceStatusReason: ResourceStatusReason,
        PhysicalResourceId: PhysicalResourceId,
        ResourceProperties: ResourceProperties
    }' \
    --output table

Besides, you may need to request increase in AWS service quota for using g5 instance in SageMaker endpoint. Could you ensure the service quota is sufficient? Try search ml.g5.xlarge for endpoint usage (or the instance type you chosen) in Service Quotas.

ThorstenBux commented 1 month ago
Screenshot 2024-08-13 at 14 08 35

Please see above regarding the error cause

ThorstenBux commented 1 month ago
Screenshot 2024-08-13 at 14 11 13

From CF on the find RootCause I get this

ThorstenBux commented 1 month ago

Here are the quotas:

Screenshot 2024-08-13 at 14 12 52
khchan123 commented 1 month ago

Screenshot 2024-08-13 at 14 11 13 From CF on the find RootCause I get this

I believe this is the CloudTrail screen instead of the CloudFormation screen. Could you capture again?

  1. Go to CloudFormation console.
  2. Select the failed stack comfyui.
  3. Choose the Events tab. 4 Choose Detect root cause. CloudFormation will analyze the failure and indicate the event that is the likely the cause for the failure by adding a Likely root cause label to the specific event Status. See Status reason for further explanation of the status in the CloudFormation console.
ThorstenBux commented 1 month ago

Hi @khchan123 , that this is indeed a CloudTrail screen not CF. But CF navigated to this CT screen when I'm following the route you suggest.

Screenshot 2024-08-13 at 14 50 39

Status reason: CannotStartContainerError. Please ensure the model container for variant comfyui-sample starts correctly when invoked with 'docker run <image> serve'

khchan123 commented 1 month ago

Hi @ThorstenBux. Could you click into the events in CloudTrail and check if any of those failed? Besides, could you dump the Parameters in CloudFormation stack?

If you have other information, feel free to capture? I am launching one from my MBP (instead of EC2) to replicate.

ThorstenBux commented 1 month ago

Parameters:

Screenshot 2024-08-13 at 15 10 04
ThorstenBux commented 1 month ago

There is no error entry in any of the CloudTrail events

ThorstenBux commented 1 month ago

I'm on a MBP Apple M1 Max 64GB Sonoma (14.5)

khchan123 commented 1 month ago

I guess I know why. The container built by your MBP M1 Max is on arm64 architecture, but the sagemaker endpoint is running on x86 architecture (g4dn, g5 or g6), that's why your container does not run and no logs shown.

Unfortunately there's no graviton (arm64) instance with GPU available now for Sagemaker endpoint. Please try again in x86 environment.

ThorstenBux commented 1 month ago

I'm trying your indirectly mentioned approach utilising a EC2 instance

khchan123 commented 1 month ago

You may also refer to here for EC2 environment.

ThorstenBux commented 1 month ago

Running the deploy script on an EC2 instance (micro) did the trick and setup the environment. (Well it ran successfully, if everything is setup properly I need to check now.)

Thank you for your support.