aws-solutions / qnabot-on-aws

AWS QnABot is a multi-channel, multi-language conversational interface (chatbot) that responds to your customer's questions, answers, and feedback. The solution allows you to deploy a fully functional chatbot across multiple channels including chat, voice, SMS and Amazon Alexa.
https://aws.amazon.com/solutions/implementations/aws-qnabot
Apache License 2.0
401 stars 253 forks source link

Deploy Sagemaker "Serverless" option fails with error "Image size 13279248421 is greater than supported" #691

Closed jeve7 closed 8 months ago

jeve7 commented 9 months ago

Describe the bug I am trying to deploy the latest version (5.5.0) in a DEV environment so I am selecting "Serverless" for Sagemaker (SagemakerInitialInstanceCount = 0). The deployment is failing with the message: "Image size 13279248421 is greater than supported size 10737418240" when is creating the Sagemaker endpoint. I guess this problem is new in 5.5.0 since I did the same previously using 5.4.5 and it worked fine.

To Reproduce

Expected behavior Deployment should work and Sagemaker will be configured as Serverless (Same as in 5.4.5) or the label/documentation is updated and Serverless is no longer an option meaning 1 Sagemaker server is the smallest available footprint.

Please complete the following information about the solution:

Screenshots CloudFormation message:

QnABot-SM-Exception

dougtoppin commented 9 months ago

@jeve7 thanks for your report, we will take a look at it and get back to you

bios6 commented 9 months ago

Hi @jeve7 ,

So I just deployed the v5.5.0 version by cloning the github repo and the SagemakerEmbeddingsStack deployment succeeded for me. Are referencing the model here? : https://github.com/aws-solutions/qnabot-on-aws/blob/a4828a0fbbffee53146b6244e3b47dd6af8dca84/templates/sagemaker-embeddings/index.js#L51

Also I believe in a previous issue you mentioned you were migrating from v5.4.5 so could it be you have modified something that could be causing this? I would recommend trying out with a fresh new deployment to see if that succeeds for you and follow the readme when deploying. This should differentiate if it's an issue with some modified changes you might have.

jeve7 commented 9 months ago

Interesting... thanks for the info. I did a brand new deployment twice using the public template from here: https://docs.aws.amazon.com/solutions/latest/qnabot-on-aws/step-1-launch-the-stack.html. I clicked the "Launch" link and that opens CloudFormation. I changed the region to "ca-central-1" and move forward. It failed twice with the same message when I selected "SagemakerInitialInstanceCount = 0". The second time I changed the rollback setting to be able to see the real error in the Sagemaker stack. It did worked when I changed "SagemakerInitialInstanceCount =1". The template is already "cooked" and everything appears to be in a bucket somewhere. I didn't touch the repo in any way.

rpilic commented 9 months ago

I also have experienced the same issue. @bios6 it wasn't clear from your response, but did you try setting SagemakerInitialInstanceCount = 0? The ability to set the qnabot to serverless mode is important for cost savings in a non-production environment.

fhoueto-amz commented 9 months ago

Hi @jeve7 The latest update to the embedding model image has a size greater than 10GB which is a limit of the sagemaker serverless container. Our current recommendation is to use an instance and not the serverless. We are reviewing this to determine what will be our way forward.

jeve7 commented 9 months ago

Sounds good, thanks for the info @fhoueto-amz.

bios6 commented 8 months ago

Closing this as Serverless is not deployable. Our documentation in the next release will be updated to mention that.