aws-samples / aws-sagemaker-build

Creates a CloudFormation template that uses AWS StepFunctions to automate the building and training of Sagemaker custom models based on S3 and GitHub events
Apache License 2.0
165 stars 44 forks source link

Unable to pull TensorFlow Container #28

Closed oelesinsc24 closed 5 years ago

oelesinsc24 commented 5 years ago

When deploying a TensorFlow training job with SageMaker Build, we get the error:

Failure reason
ClientError: Cannot pull algorithm container. Either the image does not exist or its permissions are incorrect.

The container ECR ARN is: 520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-tensorflow:1.0.0-cpu-py2.

We tried pulling the container via docker CLI after successful ECR login with AWS CLI from shell, we get the error: Error response from daemon: manifest for 520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-tensorflow:1.0.0-cpu-py2 not found.

Going through the documentation on pre-built containers, https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html, it turns out that the account 520713654638 does not have container images in the region eu-west-1. However SageMaker TensorFlow container images are available in the account 763104351884 and we were able to pull the containers successfully via docker pull:

docker pull 763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-training:1.14-gpu-py2

The would probably be here: https://github.com/aws-samples/aws-sagemaker-build/blob/5a16995af6fcf8ac12caa56f55f287ba0b288754/lambda/util/nodejs/lib/CreateImageURI.js#L37

Thanks a lot for your help

oelesinsc24 commented 5 years ago

@JohnCalhoun, we actually resolved this. It turns out that this image 520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-tensorflow:1.0.0-cpu-py2 is no longer available as the framework version, 1.0.0, is not supported anymore.

With 1.8.0, and 1.12.0, model training was successful. Image Arn:

520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-tensorflow:1.8.0-cpu-py2
520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-tensorflow:1.12.0-cpu-py2