aws-samples / mlops-amazon-sagemaker

Workshop content for applying DevOps practices to Machine Learning workloads using Amazon SageMaker
Apache License 2.0
296 stars 126 forks source link

create_training_job fails in region `eu-central-1` with `Invalid DNS suffix 'amazonaws.com' for region 'us-east-1' in training image` #9

Open peter-vandenabeele-axa opened 4 years ago

peter-vandenabeele-axa commented 4 years ago

When following the tutorial for the built-in model and deploying in eu-central-1 (Frankfurt), the lambda function in /aws/lambda/MLOps-BIA-TrainModel-pva fails with:

...
[INFO]Container Path 811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:1
...
An error occurred (ValidationException) when calling the CreateTrainingJob operation: Invalid DNS suffix 'amazonaws.com' for region 'us-east-1' in training image. Please provide the valid <region>.<dns-suffix>: 'eu-central-1.amazonaws.com'

I presume this was caused by an incorrect value supplied for the environment variable ecr_path = os.environ['AlgoECR']

at line https://github.com/aws-samples/amazon-sagemaker-devops-with-ml/blob/abac90b15b438f00c0deab4470cf162410c5d600/1-Built-In-Algorithm/lambda-code/MLOps-BIA-TrainModel.py#L70

As a proof of this, when I forced the value of ecr_path to be the correct path for eu-central-1, with the code below (adapted in the lambda function), it works:

        #Get ECR information for BIA
        algo_version = user_param['Algorithm']

        #ecr_path = os.environ['AlgoECR']
        # HARD CODE OVERRIDE by peter_v
        ecr_path = '813361260812.dkr.ecr.eu-central-1.amazonaws.com'

        container_path = ecr_path + '/' + algo_version
        print('[INFO]Container Path', container_path)

I got that specific value from this page

https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html

for the XGBoost algorithm in eu-central-1.

Maybe there is a way to set the environment variable AlgoECR value correctly, but I did not see that immediately in the tutorial README ...