Closed ThorstenBux closed 1 month ago
Hi @ThorstenBux. I can see the following error reason in the provided message:
CannotStartContainerError. Please ensure the model container for variant comfyui-sample starts correctly when invoked with 'docker run
serve'",
This error occurs when SageMaker fails to start the container to prepare the container for inference. Could you attach the CloudWatch log in log group /aws/sagemaker/Endpoints/comfyui? You should see a log stream **comfyui-sample/i-xxxxxxxxxxxx***?
Besides, do you see any error message during when running deploy.sh?
I'm also seeing this error message. From the deploy.sh script, "Failed to create/update the stack. Run the following command to fetch the list of events leading up to the failure aws cloudformation describe-stack-events --stack-name comfyui" That's when I see the "CannotStartContainerError. Please ensure the model container for variant comfyui-sample starts correctly when invoked with 'docker run
I'm not seeing the Log group you are referring to either...
FWIW, I'm running this from my M1 Macbook Pro. Could that have something to do with how the container image is built?
Hi @khchan123 , thank your for your swift reply.
I can't find the mentioned CloudWatch log entry.
These are the last lines from deploy.sh
/Users/thorstenbux/repos/rpr/sandbox/comfyui-on-amazon-sagemaker
updating: lambda_function.py (deflated 70%)
updating: workflow/ (stored 0%)
updating: workflow/workflow_api.json (deflated 75%)
adding: workflow/flux1-dev-fp8-ckpt.json (deflated 72%)
adding: workflow/flux1-schnell-fp8-ckpt.json (deflated 71%)
upload: ./lambda-.zip to s3://comfyui-sagemaker-147220985702-us-west-2/lambda/lambda-.zip
/Users/thorstenbux/repos/rpr/sandbox/comfyui-on-amazon-sagemaker
Deploying CloudFormation stack...
Waiting for changeset to be created..
Waiting for stack create/update to complete
Failed to create/update the stack. Run the following command
to fetch the list of events leading up to the failure
aws cloudformation describe-stack-events --stack-name comfyui
Thanks @ThorstenBux and @blakegreendev for the update. I just tried another clean deployment from an EC2 in us-west-2 region and it works fine, so I believe it is some environment-specific issue.
Please try to print the first failed event in CloudFormation using the following command. You may use the Detect root cause button in the AWS console (under Events tab of CloudFormation stack) to help locate the root cause.
aws cloudformation describe-stack-events \
--stack-name comfyui \
--query 'StackEvents[?ResourceStatus==`CREATE_FAILED` || ResourceStatus==`UPDATE_FAILED`].{
LogicalResourceId: LogicalResourceId,
ResourceType: ResourceType,
ResourceStatus: ResourceStatus,
ResourceStatusReason: ResourceStatusReason,
PhysicalResourceId: PhysicalResourceId,
ResourceProperties: ResourceProperties
}' \
--output table
Besides, you may need to request increase in AWS service quota for using g5 instance in SageMaker endpoint. Could you ensure the service quota is sufficient? Try search ml.g5.xlarge for endpoint usage (or the instance type you chosen) in Service Quotas.
Please see above regarding the error cause
From CF on the find RootCause I get this
Here are the quotas:
From CF on the find RootCause I get this
I believe this is the CloudTrail screen instead of the CloudFormation screen. Could you capture again?
Hi @khchan123 , that this is indeed a CloudTrail screen not CF. But CF navigated to this CT screen when I'm following the route you suggest.
Status reason:
CannotStartContainerError. Please ensure the model container for variant comfyui-sample starts correctly when invoked with 'docker run <image> serve'
Hi @ThorstenBux. Could you click into the events in CloudTrail and check if any of those failed? Besides, could you dump the Parameters in CloudFormation stack?
If you have other information, feel free to capture? I am launching one from my MBP (instead of EC2) to replicate.
Parameters:
There is no error entry in any of the CloudTrail events
I'm on a MBP Apple M1 Max 64GB Sonoma (14.5)
I guess I know why. The container built by your MBP M1 Max is on arm64 architecture, but the sagemaker endpoint is running on x86 architecture (g4dn, g5 or g6), that's why your container does not run and no logs shown.
Unfortunately there's no graviton (arm64) instance with GPU available now for Sagemaker endpoint. Please try again in x86 environment.
I'm trying your indirectly mentioned approach utilising a EC2 instance
Running the deploy script on an EC2 instance (micro) did the trick and setup the environment. (Well it ran successfully, if everything is setup properly I need to check now.)
Thank you for your support.
Hi,
I'm getting:
Could you please assist in resolving the issue.