aws / sagemaker-inference-toolkit

Serve machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
370 stars 82 forks source link

Support CodeArtifact repositories for installing Python packages #85

Closed setu4993 closed 11 months ago

setu4993 commented 3 years ago

Describe the feature you'd like

We'd like the ability to install internal Python packages via CodeArtifact instead of just PyPI.

How would this feature be used? Please describe.

To install Internal Python packages that cannot be published publicly to PyPI, in SageMaker serving instances. Adding support for CodeArtifact would integrate it better with other AWS services.

CodeArtifact provides a 12-hour token, so if we create credentials and pass them on during model package creation, it'd likely expire before the endpoint was refreshed in the future or a new batch transform job is run >12 hours after the model package creation.

(This applies more to inference jobs like endpoints and batch transforms because dependencies get installed at run-time, not build time.)

This is not as much of a concern for SageMaker Training Jobs since we can pass credentials and jobs start up almost immediately (probably an issue with spot instance jobs which have a >12 hour wait time, though). But our use case is specifically for inference related services.

A solution could be to add the AWS CodeArtifact login step with something like the below before _install_requirements: here:

@dataclass
class CodeArtifactConfig:
    domain: str
    account: int
    repository: str

def _code_artifact_login(code_artifact_config: CodeArtifactConfig):
    logger.info("logging into CodeArtifact...")
    code_artifact_login_cmd = [
        "aws",
        "codeartifact",
        "login",
        "--tool",
        "pip",
        "--domain",
        code_artifact_config.domain,
        "--domain-owner",
        code_artifact_config.account,
        "--repository",
        code_artifact_config.repository,
    ]

    try:
        subprocess.check_call(code_artifact_login_cmd)
    except subprocess.CalledProcessError:
        logger.error("failed to login to CodeArtifact, exiting")
        raise ValueError("failed to login to CodeArtifact, exiting")

And add before line 79:

     if code_artifact_config:
          _code_artifact_login(code_artifact_config)

Describe alternatives you've considered

Currently, private packages can either be served via an external service like Artifactory / Gemfury (by adding --extra-index-url <URL> to requirements.txt, or by relative imports and dependency injection during packaging.

Another alternative we've considered is forking the repo and adding the above mentioned changes in a private fork and using that for our SageMaker model deploys.

Additional context

I'm happy to take a stab at implementing this if there's interest.

humanzz commented 1 year ago

This is a feature I'm looking for as well.

setu4993 commented 1 year ago

I waited for this for a while, and reached out to our AWS reps multiple times over the years, but it was clear that this wasn't a priority for the SageMaker team.

A couple quarters ago, we implemented a workaround for this that is working nicely for us. We have a tiny wrapper on top of the SageMaker Python SDK to package models, that:

  1. Parses the requirements.txt file.
  2. For any packages that are internal (hard-coded list), downloads the tarball / wheel for those from AWS CodeArtifact and place them in a directory next to our code (<package_directory>).
  3. Set the PIP_FIND_LINKS environment variable to the folder in which it'd be available within the container (/opt/ml/model/code/<package_directory>.
  4. Invoke the SageMaker Python SDK's .deploy(...) method normally.

We do this all via CI, but it's doable even outside, with a tiny step before invoking the SageMaker Python SDK.

Hope that helps.

humanzz commented 1 year ago

This is interesting, I wonder though, why haven't you simply - using the code snippets above - in ur entry point script

  1. login to codeartifact to configure pip
  2. pip install ur own dependencies normally ?
setu4993 commented 1 year ago

Mostly because we didn't want to maintain our own copy of either a forked version of this package or repackaged Docker base images.

That adds a bunch of overhead for us across various package families and versions.