aws-samples / sagemaker-studio-image-build-cli

CLI for building Docker images in SageMaker Studio using AWS CodeBuild.
https://pypi.org/project/sagemaker-studio-image-build/
MIT No Attribution
56 stars 25 forks source link

Any way to specify additional ECR registries to log in to? #13

Open athewsey opened 3 years ago

athewsey commented 3 years ago

I'm trying to sm-docker build a container derived from SageMaker Scikit-Learn framework container in ap-southeast-1, something like the following:

base_docker_uri = sagemaker.image_uris.retrieve(
    sagemaker.sklearn.defaults.SKLEARN_NAME,
    smsess.boto_region_name,
    version="0.23-1",
    instance_type="ml.m5.xlarge",
)
# 121021644041.dkr.ecr.ap-southeast-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3

...so Dockerfile is FROM 121021644041.dkr....etc

Seems like the CLI tool spins up successfully and logs in to a load of other ECR registries, but not 121021644041: Then fails on step 1 with:

[Container] 2021/04/20 02:54:22 Running command docker build -t $IMAGE_REPO_NAME:$IMAGE_TAG .
Sending build context to Docker daemon   7.68kB
Step 1/2 : FROM 121021644041.dkr.ecr.ap-southeast-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3
Get https://121021644041.dkr.ecr.ap-southeast-1.amazonaws.com/v2/sagemaker-scikit-learn/manifests/0.23-1-cpu-py3: no basic auth credentials

[Container] 2021/04/20 02:54:22 Command did not exit successfully docker build -t $IMAGE_REPO_NAME:$IMAGE_TAG . exit status 1

I've since tested and on a SageMaker Notebook Instance I can build the same Dockerfile fine, so long as I log in to the 121021644041 ECR first.

From a cursory look at the job logs and #12, it looks like the current strategy is to have the tool ecr login to every AWS account on which AWS DLCs are provided?

...So would the correct fix be to add every account Id listed here to support SKLearn?

I was thinking it might be preferable to also add a way for users to indicate extra required account IDs through the CLI, since:

jaipreet-s commented 3 years ago

Thanks for the detailed write-up.

Agreed that the preferable approach would be to not require logging into each and every ECR registry. Your suggestion of adding a registry-id parameter works, however we could make this seamless by auto-detecting the base ECR registry and region and logging into it. Here's how that could work

  1. Read the provided Dockerfile from disk and extract the FROM statement
  2. If the FROM is an ECR repository, then extract the registry ID, region, and partition
  3. Log into the detected ECR registry in the right region and partition.
athewsey commented 3 years ago

Yeah auto-detection was my first thought & preference too - but then wondered if there might still be some edge cases a naïve implementation could miss... E.g. there could be multi-stage builds with multiple FROM statements (which should be easy enough to handle), or maybe there are use cases where it's not obvious from the Dockerfile at all which registri(es) are needed? I don't know enough to rule it out.

jaipreet-s commented 3 years ago

The underlying tenet of this library is that is works out of the box without requiring any additional inputs over a regular docker build . This works by setting sensible defaults for underlying AWS resources like S3 and CodeBuild.

There may well be edge-cases, but if we can handle 80% of the use-cases with auto-detection then that is default behavior to go with, while allowing power users to specify additional, optional fields to override the defaults.

mentos1386 commented 3 years ago

Is there any work around for this problem?

I'm trying to run the following dockerfile https://github.com/aws/amazon-sagemaker-examples/blob/master/training/distributed_training/tensorflow/data_parallel/maskrcnn/Dockerfile

EDIT: I had actually an issue that ECR authentication was being done for us-east-1 when Dockerfile contained image from us-west-2. Changing region in Dockerfile to us-east-1 fixed the issue.

kondo-kj commented 2 years ago

I get the same error when I try to build an image based on this FROM statement. FROM 683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.2-1

I thought the account 683313688378 was logged in with the following command, but what could be the cause or workaround? Running command $(aws ecr get-login --no-include-email --region $AWS_DEFAULT_REGION --registry-ids 683313688378)

ajay-bhargava commented 2 years ago

I share the same concern as the author of this Issue. It is not possible to share AUTH credentials with the library. As a result, it is impossible to build upon ECR registry containers.

Similar example:

"https://763104351884.dkr.ecr.us-east-1.amazonaws.com/v2/pytorch-training/manifests/1.11.0-gpu-py38-cu115-ubuntu20.04-e3": no basic auth credentials

Even if you run immediately before:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com