aws / sagemaker-distribution

A set of Docker images that include popular frameworks for machine learning, data science and visualization.
Apache License 2.0
89 stars 50 forks source link

Using a custom SM image based on sagemaker-distribution hangs and fails in SM studio #178

Open FrikadelleHelle opened 6 months ago

FrikadelleHelle commented 6 months ago

I am not quite sure where to report but since the docs outline how to build a custom image I will try here.

I am building this custom image and pushing it to ECR and adding to sagemaker images and creating app image config, like one would according to the docs.

I am defining my docker image like this

FROM --platform=linux/amd public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu
USER $ROOT
RUN apt-get clean
# dependencies for building python and having opencv
RUN apt-get update && \
    apt-get install -y gcc g++ python3-dev ffmpeg libsm6 libxext6 && \
    rm -rf /var/lib/apt/lists/* && \
    apt-get clean

USER $MAMBA_USER
# copy the environment.yml file into the container
COPY --chown=$MAMBA_USER:$MAMBA_USER processing/environment.yml /tmp/environment.yml

# Use micromamba to install the dependencies from the environment.yml file
RUN micromamba install -y -n base -f /tmp/environment.yml && \
  micromamba clean --all --yes

The only difference I can see in the logs is this these two lines at 2024-02-09T10:23:34.006+01:00 and 2024-02-09T10:23:34.006+01:00


2024-02-09T10:23:34.006+01:00   [I 2024-02-09 09:23:33.875 ServerApp] Loading SageMaker Studio EMR server extension 0.1.9

2024-02-09T10:23:34.006+01:00   [I 2024-02-09 09:23:33.876 ServerApp] sagemaker_jupyterlab_emr_extension | extension was successfully loaded.

2024-02-09T10:23:34.006+01:00   [I 2024-02-09 09:23:33.876 ServerApp] Loading SageMaker JupyterLab server extension 0.2.0

2024-02-09T10:23:34.006+01:00   [I 2024-02-09 09:23:33.877 ServerApp] sagemaker_jupyterlab_extension | extension was successfully loaded.

2024-02-09T10:23:34.006+01:00   [I 2024-02-09 09:23:33.877 ServerApp] Loading SageMaker JupyterLab common server extension 0.1.9

2024-02-09T10:23:34.006+01:00   [I 2024-02-09 09:23:33.877 ServerApp] sagemaker_jupyterlab_extension_common | extension was successfully loaded.

2024-02-09T10:23:34.006+01:00   [I 2024-02-09 09:23:33.878 ServerApp] Serving notebooks from local directory: /home/sagemaker-user

this line -------> 2024-02-09T10:23:34.006+01:00    [I 2024-02-09 09:23:33.878 ServerApp] Jupyter Server 2.10.0 is running at: <-------- this line

2024-02-09T10:23:34.006+01:00   [I 2024-02-09 09:23:33.878 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

2024-02-09T10:23:34.006+01:00   [W 2024-02-09 09:23:33.882 ServerApp] No web browser found: Error('could not locate runnable browser').

this line -------> 2024-02-09T10:23:34.006+01:00    [C 2024-02-09 09:23:33.882 ServerApp] To access the server, open this file in a browser: file:///home/sagemaker-user/.local/share/jupyter/runtime/jpserver-1-open.html Or copy and paste one of these URLs: <------ and this line

2024-02-09T10:23:34.006+01:00   INFO: State start

2024-02-09T10:23:34.006+01:00   INFO: Scheduler at: inproc://169.255.254.1/1/1

2024-02-09T10:23:34.006+01:00   INFO: dashboard at: http://169.255.254.1:8787/status

2024-02-09T10:23:34.006+01:00   INFO: Registering Worker plugin shuffle

2024-02-09T10:23:34.006+01:00   INFO: Start worker at: inproc://169.255.254.1/1/4

2024-02-09T10:23:34.006+01:00   INFO: Listening to: inproc169.255.254.1

2024-02-09T10:23:34.006+01:00   INFO: Worker name: 0

2024-02-09T10:23:34.006+01:00   INFO: dashboard at: 169.255.254.1:39899

2024-02-09T10:23:34.006+01:00   INFO: Waiting to connect to: inproc://169.255.254.1/1/1

2024-02-09T10:23:34.006+01:00   INFO: -------------------------------------------------

2024-02-09T10:23:34.006+01:00   INFO: Threads: 2

2024-02-09T10:23:34.006+01:00   INFO: Memory: 3.78 GiB

2024-02-09T10:23:34.006+01:00   INFO: Local Directory: /tmp/dask-scratch-space/worker-ylr01t6j

2024-02-09T10:23:35.259+01:00   INFO: -------------------------------------------------

2024-02-09T10:23:35.260+01:00   INFO: Register worker <WorkerState 'inproc://169.255.254.1/1/4', name: 0, status: init, memory: 0, processing: 0>

2024-02-09T10:23:35.260+01:00   INFO: Starting worker compute stream, inproc://169.255.254.1/1/4

2024-02-09T10:23:35.260+01:00   INFO: Starting established connection to inproc://169.255.254.1/1/5

2024-02-09T10:23:35.260+01:00   INFO: Starting Worker plugin shuffle

2024-02-09T10:23:35.260+01:00   INFO: Registered to: inproc://169.255.254.1/1/1

2024-02-09T10:23:35.260+01:00   INFO: -------------------------------------------------

2024-02-09T10:23:35.260+01:00   INFO: Starting established connection to inproc://169.255.254.1/1/1

2024-02-09T10:23:35.260+01:00   INFO: Receive client connection: Client-e58dc3c4-c72c-11ee-8001-6efbcde7e649

2024-02-09T10:23:35.510+01:00   INFO: Starting established connection to inproc://169.255.254.1/1/6

2024-02-09T10:23:39.515+01:00   [I 2024-02-09 09:23:35.316 ServerApp] Skipped non-installed server(s): bash-language-server, dockerfile-language-

In the working images they have a URL that's configured correctly.

I am in VPC only mode for the domain,, but I dont see how that should change anything since the sagemaker-distribution image works fine.

Would appreciate any pointer

claytonparnell commented 6 months ago

Can you provide the environment.yml packages? And are these logs from /aws/sagemaker/studio cloudwatch group?

FrikadelleHelle commented 6 months ago

Yes, these are logs for the /aws/sagemaker/studio log group

This is a bare-bones example of environment.yml that fails for me.

name: base
channels:
- conda-forge
dependencies:
- python==3.10
- jupyterlab
- pip
- pip:
  - ipykernel
  - sagemaker

But its correct that this should work be able to work as a custom Jupyter Lab image in the new studio as well?

If it helps I can provide my config too

 aws sagemaker describe-image
        {
            "CreationTime": 1707914687.086,
            "DisplayName": "prod-sagemaker-image",
            "ImageArn": "arn:aws:sagemaker:some-arn",
            "ImageName": "image_name_1",
            "ImageStatus": "CREATED",
            "LastModifiedTime": 1707914688.064
        },

aws sagemaker describe-app-image-config
        {
            "AppImageConfigArn": "arn:aws:sagemaker:some_app_image_config_arn",
            "AppImageConfigName": "sagemaker-app-image-config",
            "CreationTime": 1707403360.291,
            "LastModifiedTime": 1707405437.631,
            "KernelGatewayImageConfig": {
                "KernelSpecs": [
                    {
                        "Name": "python3",
                        "DisplayName": "yesyes"
                    }
                ],
                "FileSystemConfig": {
                    "MountPath": "/home/sagemaker-user",
                    "DefaultUid": 1000,
                    "DefaultGid": 100
                }
            },
            "JupyterLabAppImageConfig": {
                "ContainerConfig": {
                    "ContainerEntrypoint": [
                        "jupyter-lab"
                    ]
                }
            }
        },

aws pagemaker describe-domain
{
    "DomainArn": "arn:aws:sagemaker:some_domain_arn",
    "DomainId": "d-mmo40dnf710s",
    "DomainName": "sagemaker-domain",
    "HomeEfsFileSystemId": "fs-",
    "SingleSignOnManagedApplicationInstanceId": "ins-",
    "SingleSignOnApplicationArn": "arn:aws:sso::application/",
    "Status": "InService",
    "CreationTime": 1707688181.443,
    "LastModifiedTime": 1707913859.291,
    "AuthMode": "SSO",
    "DefaultUserSettings": {
        "ExecutionRole": "arn:aws:iam::some_role_arn",
        "SecurityGroups": [
            "sg-"
        ],
        "JupyterServerAppSettings": {
            "LifecycleConfigArns": []
        },
        "KernelGatewayAppSettings": {
            "CustomImages": [
                {
                    "ImageName": "prod-sagemaker-image",
                    "ImageVersionNumber": 1,
                    "AppImageConfigName": "sagemaker-app-image-config"
                }
            ],
            "LifecycleConfigArns": []
        },
        "CodeEditorAppSettings": {
            "LifecycleConfigArns": []
        },
        "JupyterLabAppSettings": {
            "DefaultResourceSpec": {
                "InstanceType": "ml.t3.medium"
            },
            "CustomImages": [
                {
                    "ImageName": "prod-sagemaker-image",
                    "ImageVersionNumber": 1,
                    "AppImageConfigName": "sagemaker-app-image-config"
                }
            ],
            "LifecycleConfigArns": [
                "arn:aws:sagemaker:lifecycle_arn"
            ]
        },
        "SpaceStorageSettings": {
            "DefaultEbsStorageSettings": {
                "DefaultEbsVolumeSizeInGb": 5,
                "MaximumEbsVolumeSizeInGb": 100
            }
        },
        "DefaultLandingUri": "studio::",
        "StudioWebPortal": "ENABLED"
    },
    "DomainSettings": {
        "SecurityGroupIds": [
            "sg-"
        ],
        "DockerSettings": {
            "EnableDockerAccess": "ENABLED",
            "VpcOnlyTrustedAccounts": []
        }
    },
    "AppNetworkAccessType": "VpcOnly",
    "SubnetIds": [
        "subnet-",
        "subnet-"
    ],
    "VpcId": "vpc-",
    "AppSecurityGroupManagement": "Customer",
    "DefaultSpaceSettings": {
        "ExecutionRole": "arn:aws:iam::some_role_arn",
        "SecurityGroups": [
            "sg-"
        ],
        "JupyterServerAppSettings": {
            "DefaultResourceSpec": {
                "SageMakerImageArn": "arn:aws:sagemaker:eu-north-1:243637512696:image/jupyter-server-3",
                "InstanceType": "system"
            }
        }
    }
}
sjcahill-fcc commented 6 months ago

Our team has struggled with this as well.

I tried my best to reproduce your image based on the Dockerfile and env.yml and was able to get it to work.

The main difference is that instead of relying on the app-image-config property:

"JupyterLabAppImageConfig": { "ContainerConfig": { "ContainerEntrypoint": [ "jupyter-lab" ] } } },

we define the ENTRYPOINT and CMD in our Dockefile directly in accordance with https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-jl-image-specifications.html.

This was because we had a hard time getting the "ContainerEntrypoint" to work.

Below is the Dockerfile I used (the micromamba config is due to our proxy):

FROM --platform=linux/amd public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu

USER $ROOT
RUN apt-get clean
# dependencies for building python and having opencv
RUN apt-get update && \
  apt-get install -y gcc g++ python3-dev ffmpeg libsm6 libxext6 && \
  rm -rf /var/lib/apt/lists/* && \
  apt-get clean

USER $MAMBA_USER
# copy the environment.yml file into the container
COPY --chown=$MAMBA_USER:$MAMBA_USER env_help.yml /tmp/environment.yml

RUN micromamba config prepend channels "CONDA-FORGE-PROXY" && \
  micromamba config prepend channels "CONDA-PROXY" && \
  micromamba config set channel_alias "CONDA-PROXY" && \
  micromamba config set channel_priority flexible && \
  micromamba config set pip_interop_enabled True && \
  micromamba config set ssl_verify /etc/ssl/certs/ca-certificates.crt

# Use micromamba to install the dependencies from the environment.yml file
RUN micromamba install -y -n base -f /tmp/environment.yml && \
  micromamba clean --all --yes

ENTRYPOINT ["jupyter-lab"]
CMD ["--ServerApp.ip=0.0.0.0", "--ServerApp.port=8888", "--ServerApp.allow_origin=*", "--ServerApp.token=''", "--ServerApp.base_url=/jupyterlab/default"]  

Your logs seem to suggest that the CMD portion of this is missing since you do not get these logs (last two):

  2024-02-19T16:01:54.014-05:00 [I 2024-02-19 21:01:53.893 ServerApp] Serving notebooks from local directory: /home/sagemaker-user
  2024-02-19T16:01:54.014-05:00 [I 2024-02-19 21:01:53.893 ServerApp] Jupyter Server 2.10.0 is running at:
  2024-02-19T16:01:54.014-05:00 [I 2024-02-19 21:01:53.893 ServerApp] http://default:8888/jupyterlab/default/lab
  2024-02-19T16:01:54.014-05:00 [I 2024-02-19 21:01:53.894 ServerApp] http://127.0.0.1:8888/jupyterlab/default/lab

NOTE:

I pushed the image to ECR and then just used the console to create and attach the image to the domain.

We use CDK to do our actual deployments.

Also, your app image config will have to at least have an empty {} for the "JupyterLabAppImageConfig" even if you decide to stop using this for the entrypoint stuff.