aws-samples / sagemaker-studio-custom-image-samples

This repository contains examples of Docker images that can be used as custom images for KernelGateway Apps in SageMaker Studio
MIT No Attribution
121 stars 86 forks source link

conda-env-kernel-image example is broken #22

Open tom-mcclintock opened 2 years ago

tom-mcclintock commented 2 years ago

After following the steps listed here exactly I began a SageMaker Studio session. After creating selecting the custom image and beginning a console I received the following error:

Invalid response: 404 Not Found
Kernel with name [myenv] does not exist in image [arn:aws:sagemaker:REGION:ACCOUNT_ID:image/conda-test-kernel] on the KernelGateway App [conda-test-kernel-ml-t3-medium-HASH]. To make the kernel available, either update your AppImageConfig to have same kernel name as available in the image or update your SageMaker Image to have the kernel with the same name as specified in AppImageConfig. You can use https://github.com/aws-samples/sagemaker-studio-custom-image-samples/blob/main/DEVELOPMENT.md#local-testing for testing your image locally.

The Dockerfile and environment.yml are identical to the example. Here is the app-image-config-input.json file:

{
    "AppImageConfigName": "myenv-config",
    "KernelGatewayImageConfig": {
        "KernelSpecs": [
            {
                "Name": "myenv",
                "DisplayName": "Python [conda env: myenv]"
            }
        ],
        "FileSystemConfig": {
            "MountPath": "/home/sagemaker-user",
            "DefaultUid": 0,
            "DefaultGid": 0
        }
    }
}

And here is the anonymized create-domain-input.json contents:

{
    "DomainId": "d-xxxxxxxxx",
    "DefaultUserSettings": {
        "ExecutionRole": "ROLE_ARN",
        "KernelGatewayAppSettings": {
            "CustomImages": [
                {
                    "ImageName": "conda-test-kernel",
                    "AppImageConfigName": "myenv-config"
                }
            ]
        }
    }
}

I used IMAGE_NAME=conda-test-kernel throughout. Other things to note:

I believe the issue is that conda doesn't automatically follow the kernelspec. This quirk needs to be covered in the README for this example. Unfortunately I haven't figure out the solution yet. Any help is appreciated.

tday commented 2 years ago

I have a similar issue in my own conda container where the default conda env is always the base env, but I cannot switch to my conda env in the notebook.

!conda env list
# conda environments:
#
base                  *  /home/ubuntu/miniconda
pipeline                 /home/ubuntu/miniconda/envs/pipeline
!conda activate pipeline

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.
lilitangsonos commented 2 years ago

I have a similar issue in my own conda container where the default conda env is always the base env, but I cannot switch to my conda env in the notebook.

!conda env list
# conda environments:
#
base                  *  /home/ubuntu/miniconda
pipeline                 /home/ubuntu/miniconda/envs/pipeline
!conda activate pipeline

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.

I am also running into this issue. Were you able to fix it?

Zirkonium88 commented 2 years ago

I was able to make use of the example. But I also started mounting from /root, as I did not see any users within these images. This also the difference to @tom-mcclintock.

config_app = {
            "AppImageConfigName": "conda-env-kernel-config",
            "KernelGatewayImageConfig": {
                "KernelSpecs": [
                    {
                        "Name": "conda-env-venv-py",
                        "DisplayName": "Python [conda env: venv]"
                    }
                ],
                "FileSystemConfig": {
                    --> "MountPath": "/root", <--
                    "DefaultUid": 0,
                    "DefaultGid": 0
                }
            }
        }

My domain update json looks like this

config_domain = {
            "DomainId": domain_id,
            "DefaultUserSettings": {
                "KernelGatewayAppSettings": {
                    "CustomImages": [
                        {
                            "ImageName": "conda",
                            "AppImageConfigName": "conda-env-kernel-config",
                        }
                    ]
                }
            }
        }

With that, I'm able to import packages within Sagemaker Studio. In general the Docker files are not in line with Docker best practices

athewsey commented 1 year ago

Some observations from testing last/this week:

Additional experiments/notes:

Kernel auto-detection

Auto-detection of non-base kernel envs does seem to work for me as the sample README describes: E.g. if I create a conda env mycoolenv in the image, then I can set up SageMaker KernelSpec Name conda-env-mycoolenv-py. I logged some feedback on the kernel spec doc page to suggest clarifying this naming on the "Kernel discovery" section.

I find we can also manually register conda envs as notebook kernels in the Dockerfile using something like the below - but it's a bit pointless because I just end up with 2 kernels visible in Studio: The manually created one and the auto-detected one.

RUN bash -c 'source activate mycoolenv && python -m ipykernel install --name mycoolenv --display-name "Conda mycoolenv"'

I do see the same issue as @tday that, when using this setup, image terminals are unable to switch conda envs, which I think is related to the user situation below:

Using non-root user / switching envs in terminal

I did manage to get a notebook-user-editable (i.e. can %pip install) custom image working using a non-root user and a non-base conda env, by making sure my 1000:100 user got permissions to edit the /opt/conda folder.

Next steps

Maybe we could try to have 2 samples to capture both a simple, root+base-based configuration, and a complex, non-root/non-base option separately? As seems to me like it would over-complicate the initial getting started to dive straight into that? I think for this issue the initial bug itself seems resolved.

mkaja commented 1 month ago

docker build . -t ${IMAGE_NAME} -t ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom:${IMAGE_NAME} [+] Building 495.5s (8/8) FINISHED docker:desktop-linux => [internal] load build definition from Dockerfile 0.0s => => transferring dockerfile: 143B 0.0s => [internal] load metadata for docker.io/continuumio/miniconda3:4.9.2 0.9s => [auth] continuumio/miniconda3:pull token for registry-1.docker.io 0.0s => [internal] load .dockerignore 0.0s => => transferring context: 2B 0.0s => [internal] load build context 0.0s => => transferring context: 36B 0.0s => [1/3] FROM docker.io/continuumio/miniconda3:4.9.2@sha256:7838d0ce65783b0d944c19d193e2e6232196bada9e5f3762dc7a9f07dc271179 0.0s => CACHED [2/3] COPY environment.yml . 0.0s => ERROR [3/3] RUN conda env update -f environment.yml --prune 494.5s

[3/3] RUN conda env update -f environment.yml --prune: 0.811 Collecting package metadata (repodata.json): ...working... done 77.96 Solving environment: ...working... Killed

Dockerfile:4

2 | 3 | COPY environment.yml . 4 | >>> RUN conda env update -f environment.yml --prune 5 |

ERROR: failed to solve: process "/bin/sh -c conda env update -f environment.yml --prune" did not complete successfully: exit code: 137