conda-env-kernel-image example is broken

tom-mcclintock commented 2 years ago

After following the steps listed here exactly I began a SageMaker Studio session. After creating selecting the custom image and beginning a console I received the following error:

Invalid response: 404 Not Found
Kernel with name [myenv] does not exist in image [arn:aws:sagemaker:REGION:ACCOUNT_ID:image/conda-test-kernel] on the KernelGateway App [conda-test-kernel-ml-t3-medium-HASH]. To make the kernel available, either update your AppImageConfig to have same kernel name as available in the image or update your SageMaker Image to have the kernel with the same name as specified in AppImageConfig. You can use https://github.com/aws-samples/sagemaker-studio-custom-image-samples/blob/main/DEVELOPMENT.md#local-testing for testing your image locally.

The Dockerfile and environment.yml are identical to the example. Here is the app-image-config-input.json file:

{
    "AppImageConfigName": "myenv-config",
    "KernelGatewayImageConfig": {
        "KernelSpecs": [
            {
                "Name": "myenv",
                "DisplayName": "Python [conda env: myenv]"
            }
        ],
        "FileSystemConfig": {
            "MountPath": "/home/sagemaker-user",
            "DefaultUid": 0,
            "DefaultGid": 0
        }
    }
}

And here is the anonymized create-domain-input.json contents:

{
    "DomainId": "d-xxxxxxxxx",
    "DefaultUserSettings": {
        "ExecutionRole": "ROLE_ARN",
        "KernelGatewayAppSettings": {
            "CustomImages": [
                {
                    "ImageName": "conda-test-kernel",
                    "AppImageConfigName": "myenv-config"
                }
            ]
        }
    }
}

I used IMAGE_NAME=conda-test-kernel throughout. Other things to note:

aws sagemaker describe-image-version shows "ImageVersionStatus": "CREATED"
aws sagemaker describe-app-image-config gives back all the expected information

I believe the issue is that conda doesn't automatically follow the kernelspec. This quirk needs to be covered in the README for this example. Unfortunately I haven't figure out the solution yet. Any help is appreciated.

tday commented 2 years ago

I have a similar issue in my own conda container where the default conda env is always the base env, but I cannot switch to my conda env in the notebook.

!conda env list
# conda environments:
#
base                  *  /home/ubuntu/miniconda
pipeline                 /home/ubuntu/miniconda/envs/pipeline

!conda activate pipeline

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.

lilitangsonos commented 2 years ago

I have a similar issue in my own conda container where the default conda env is always the base env, but I cannot switch to my conda env in the notebook.

!conda env list
# conda environments:
#
base                  *  /home/ubuntu/miniconda
pipeline                 /home/ubuntu/miniconda/envs/pipeline

!conda activate pipeline

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.

I am also running into this issue. Were you able to fix it?

Zirkonium88 commented 2 years ago

I was able to make use of the example. But I also started mounting from /root, as I did not see any users within these images. This also the difference to @tom-mcclintock.

config_app = {
            "AppImageConfigName": "conda-env-kernel-config",
            "KernelGatewayImageConfig": {
                "KernelSpecs": [
                    {
                        "Name": "conda-env-venv-py",
                        "DisplayName": "Python [conda env: venv]"
                    }
                ],
                "FileSystemConfig": {
                    --> "MountPath": "/root", <--
                    "DefaultUid": 0,
                    "DefaultGid": 0
                }
            }
        }

My domain update json looks like this

config_domain = {
            "DomainId": domain_id,
            "DefaultUserSettings": {
                "KernelGatewayAppSettings": {
                    "CustomImages": [
                        {
                            "ImageName": "conda",
                            "AppImageConfigName": "conda-env-kernel-config",
                        }
                    ]
                }
            }
        }

With that, I'm able to import packages within Sagemaker Studio. In general the Docker files are not in line with Docker best practices

athewsey commented 1 year ago

Some observations from testing last/this week:

The updated sample as of #26 (use base conda env name, python3 kernel name, /root mount point, 0:0 UID:GID) does seem to work for me - so maybe this particular issue is now resolved?
But I noticed when setting Images up through the SageMaker console UI, the settings (mount point, UID, kernel) are sometimes reverting between steps (e.g. from creating the image, to the image version, to attaching the image version to the domain). ⚠️ Suggest expanding out the collapsed sections at each point and checking they still show what you expect!

Additional experiments/notes:

Kernel auto-detection

Auto-detection of non-base kernel envs does seem to work for me as the sample README describes: E.g. if I create a conda env mycoolenv in the image, then I can set up SageMaker KernelSpec Name conda-env-mycoolenv-py. I logged some feedback on the kernel spec doc page to suggest clarifying this naming on the "Kernel discovery" section.

I find we can also manually register conda envs as notebook kernels in the Dockerfile using something like the below - but it's a bit pointless because I just end up with 2 kernels visible in Studio: The manually created one and the auto-detected one.

RUN bash -c 'source activate mycoolenv && python -m ipykernel install --name mycoolenv --display-name "Conda mycoolenv"'

I do see the same issue as @tday that, when using this setup, image terminals are unable to switch conda envs, which I think is related to the user situation below:

Using non-root user / switching envs in terminal

Using non-root user adds complexity to the setup, so need to check what security objectives you're actually trying to deliver and whether it helps. For example, should notebook users still be able to edit the environments and pip install / conda install extra packages ad-hoc? Given the architecture of Studio and the boundaries in the shared security responsibility model, is the extra isolation of running non-root helpful?
The default mount points /home/sagemaker-user and /root replace user home directories in the container image with whatever's in the Jupyter working directory. In some cases this might be useful (e.g. propagating settings correctly between server and kernel), but it does mean any user settings files defined in the kernel container will get obliterated.
- For example using ipykernel install --user ... or conda install --prefix ... to install kernels and conda envs under your user's home folder is no good if the entire home folder gets substituted at run-time.
- It doesn't really surprise me that conda env switching is broken by default if things like ~/.bashrc, .bash_profile, etc get obliterated too. Probably there's some way of getting this working but I haven't dived deep yet?
I think it might cause problems if trying to use the 1000:100 user without actually creating it in the image (like done here)? Maybe that was causing issues with not being able to auto-discover conda kernels before?

I did manage to get a notebook-user-editable (i.e. can %pip install) custom image working using a non-root user and a non-base conda env, by making sure my 1000:100 user got permissions to edit the /opt/conda folder.

Next steps

Maybe we could try to have 2 samples to capture both a simple, root+base-based configuration, and a complex, non-root/non-base option separately? As seems to me like it would over-complicate the initial getting started to dive straight into that? I think for this issue the initial bug itself seems resolved.

mkaja commented 4 months ago

docker build . -t ${IMAGE_NAME} -t ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom:${IMAGE_NAME} [+] Building 495.5s (8/8) FINISHED docker:desktop-linux => [internal] load build definition from Dockerfile 0.0s => => transferring dockerfile: 143B 0.0s => [internal] load metadata for docker.io/continuumio/miniconda3:4.9.2 0.9s => [auth] continuumio/miniconda3:pull token for registry-1.docker.io 0.0s => [internal] load .dockerignore 0.0s => => transferring context: 2B 0.0s => [internal] load build context 0.0s => => transferring context: 36B 0.0s => [1/3] FROM docker.io/continuumio/miniconda3:4.9.2@sha256:7838d0ce65783b0d944c19d193e2e6232196bada9e5f3762dc7a9f07dc271179 0.0s => CACHED [2/3] COPY environment.yml . 0.0s => ERROR [3/3] RUN conda env update -f environment.yml --prune 494.5s

[3/3] RUN conda env update -f environment.yml --prune: 0.811 Collecting package metadata (repodata.json): ...working... done 77.96 Solving environment: ...working... Killed

Dockerfile:4

2 | 3 | COPY environment.yml . 4 | >>> RUN conda env update -f environment.yml --prune 5 |

ERROR: failed to solve: process "/bin/sh -c conda env update -f environment.yml --prune" did not complete successfully: exit code: 137

aws-samples / sagemaker-studio-custom-image-samples