aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
10.12k stars 6.77k forks source link

[Bug Report] SM Data Parallel with large models - bash command to execute the docker container build script does not work #2045

Open mchoi8739 opened 3 years ago

mchoi8739 commented 3 years ago

Link to the notebook

Describe the bug Ran the first 5 code cells for pushing the provided docker container with the image and tags:

image = "bert-smdataparallel-sagemaker"  # Example: bert-smdataparallel-sagemaker
tag = "pt-1-6"   # Example: pt1.6 

The following cell does not complete the docker build:

%%time
! chmod +x build_and_push.sh; bash build_and_push.sh {region} {image} {tag}

returning the following error:

Provided region_name '{region}' doesn't match a supported format.

Provided region_name '{region}' doesn't match a supported format.
Error: Cannot perform an interactive login from a non TTY device
invalid argument "{image}" for "-t, --tag" flag: invalid reference format
See 'docker build --help'.
Error parsing reference: "{image}" is not a valid repository/tag: invalid reference format
invalid reference format
Error: Image build and push failed
CPU times: user 20.7 ms, sys: 6.68 ms, total: 27.4 ms
Wall time: 1.76 s

To Reproduce Run the first 5 code cells of the notebook.

mchoi8739 commented 3 years ago

Refreshed kernel conda_pytorch_p36 and retried. Getting the following error:

WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
Sending build context to Docker daemon  45.57kB
Step 1/4 : ARG region
Step 2/4 : FROM 763104351884.dkr.ecr.${region}.amazonaws.com/pytorch-training:1.6.0-gpu-py36-cu110-ubuntu18.04
Get https://763104351884.dkr.ecr.us-east-1.amazonaws.com/v2/pytorch-training/manifests/1.6.0-gpu-py36-cu110-ubuntu18.04: no basic auth credentials
Error response from daemon: No such image: bert-smdataparallel-sagemaker:latest
The push refers to repository [111122223333.dkr.ecr.us-east-1.amazonaws.com/bert-smdataparallel-sagemaker]
An image does not exist locally with the tag: 111122223333.dkr.ecr.us-east-1.amazonaws.com/bert-smdataparallel-sagemaker
Error: Image build and push failed
CPU times: user 24.4 ms, sys: 1.39 ms, total: 25.8 ms
Wall time: 1.7 s
hongshanli23 commented 3 years ago

111122223333 does not look like a real account

mchoi8739 commented 3 years ago

111122223333 does not look like a real account

While copying the log, I switched the ARN number to protect my account information.

mchoi8739 commented 3 years ago
  1. we need to add
    ! aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
  2. The notebook instance to run the notebook should be at least p2.xlarge. The docker build step requires the CUDA environment.
  3. FSx steps need to be added at the beginning of the notebooks. https://github.com/aws/amazon-sagemaker-examples/blob/master/advanced_functionality/distributed_tensorflow_mask_rcnn/mask-rcnn-fsx.ipynb
mchoi8739 commented 3 years ago

For the item number 2 above, the stage directory (the current notebook kernel) must be on EBS volume with 100 GB available space to download the benchmark dataset.

mchoi8739 commented 3 years ago

The maskrcnn example notebook returns the same error that was reported and fixed in https://github.com/aws/amazon-sagemaker-examples/pull/2056. Need to update the maskrcnn and bert training scripts as well.

santosh4b6 commented 1 year ago

I am also getting similar error, @mchoi8739 are you able to run now?

code ! aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {dlc_account_id}.dkr.ecr.{region}.amazonaws.com ! chmod +x build_and_push.sh; bash build_and_push.sh {dlc_account_id} {region} {image} {tag}

Error WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json. Configure a credential helper to remove this warning. See https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json. Configure a credential helper to remove this warning. See https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded error checking context: 'can't stat '/home/ec2-user/SageMaker/lost+found''. Error response from daemon: No such image: mask-rcnn-smdataparallel-sagemaker:latest The push refers to repository [xyz.dkr.ecr.us-east-1.amazonaws.com/mask-rcnn-smdataparallel-sagemaker] An image does not exist locally with the tag: xyz.dkr.ecr.us-east-1.amazonaws.com/mask-rcnn-smdataparallel-sagemaker Error: Image build and push failed

saskra commented 1 year ago

Same problem here!

HuBaX commented 8 months ago

I'm having the same problem as @santosh4b6 with the error checking context: 'can't stat '/home/ec2-user/SageMaker/lost+found''. Does anyone have a fix?

santosh4b6 commented 8 months ago

@HuBaX Somewhere I read sagemaker studio does not support Docker related commands. Instead they have sm-docker for building docker image. You might need to get codebuild permissions.

Steps to build docker image and push it to ecr in a studio notebook

  1. !pip install sagemaker-studio-image-build
  2. !sm-docker build . --repository {docker_image}

I hope this helps you.