aws-samples / amazon-sagemaker-studio-package-management

Other
13 stars 1 forks source link

How to manage Python packages in Amazon SageMaker Studio notebooks

This repository presents hands-on samples for the recommended practices on how to manage Python packages and package versions in Amazon SageMaker Studio Notebooks.

For more details refer to the related blog post Four approaches to manage Python packages in Amazon SageMaker Studio notebooks on the AWS Machine Learning Blog.

You have the following options for installing packages and creating virtual environments in Studio:

  1. Use a SageMaker custom app image
  2. Use Studio notebook lifecycle configurations
  3. Use Studio's EFS to persist Conda environments
  4. Use pip install

Studio notebooks run in a Docker container, while SageMaker Notebook instances are hosted on EC2 instances. Because of this difference, there are some specifics how you create and manage Python virtual environments in Studio notebooks, for example usage of Conda environments or persistence of ML development environments between kernel restarts.

Conda doesn't work well within Docker container, for example see the blog post Activating a Conda environment in your Dockerfile.

Following sections give run down of each of four recommended package management options.

How to run the notebooks

SageMaker custom app image

A SageMaker image or app image is a Docker container that identifies the kernels, language packages, and other dependencies required to run a Jupyter notebook in Studio. You use these images to create environments that you then run Jupyter notebooks on. Amazon SageMaker provides many built-in images for you to use.

If you need different functionality and packages, you can bring your own custom images to Studio (BYOI). You can create app images and image versions, and attach image versions to your domain, using the SageMaker control panel, the AWS SDK for Python (Boto3), and the AWS Command Line Interface (AWS CLI).

The main benefit is that all packages are ready to use immediately since they are already installed in the image. You can implement a CI/CD pipeline to produce custom images and enforce your organization specific guardrails and governance processes.

The provided notebook implements an image creation process for Conda-based environments.

Refer to these sample notebooks for more details on custom app implementation.

You can use Studio image build CLI to automate process of app image creation and deployment.

Studio notebook lifecycle configurations

Studio Lifecycle Configurations define a startup script that executed at each restart of the kernel gateway application and can install the required packages. The main benefit is that a data scientist can choose which script to execute to customize the notebook container with new packages, without rebuilding the container and in most of the cases without requiring a custom image at all as they could customize the built-in ones. The main limitation is that installing packages at each restart might be slow. It might even timeout and you need to define a process to let a data scientist customize these scripts. You also have an overhead for managing the lifecycle scripts at scale.

Refer to these lifecycle configuration examples for more details.

Persist Conda environments to Studio's EFS

SageMaker domain and Studio use an Amazon EFS volume as a persistent storage layer. You can save your Conda environments on the EFS volume. These environments are persistent between kernel, app, or Studio restarts. Studio automatically picks up all environments as KernelGateway kernels. This is a straightforward process for a data scientist, but there is a material (about one minute) delay for the environment to appear in the list of selectable kernels. There also might be issues with using environments for kernel gateway apps that have different compute requirements, for example CPU-based environment on a GPU-based app.

Refer to this example for detailed instructions.

pip install

You can install packages directly into default conda environment or into the default Python environment. Create a setup.py or requirements.txt file with all required dependencies and run %pip install . -r requirement.txt. You have to run this command every time you restart kernel or re-create an app. This approach is recommended for ad hoc experimentation because these environments are not persistent. Some enterprise environments block all egress and ingress internet connections and you cannot use pip install to pull Python packages.

Resources

QR code for this repository

You can use the following QR code to link this repository.

Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: MIT-0