aws / aws-step-functions-data-science-sdk-python

Step Functions Data Science SDK for building machine learning (ML) workflows and pipelines on AWS
Apache License 2.0
285 stars 87 forks source link

Make sagemaker an optional dependency #184

Open nicolaei opened 2 years ago

nicolaei commented 2 years ago

I'd like to see sagemaker marked as an optional dependency, as it is not always required.

Use Case

Not all users of this package are using the sagemaker dependency, and it adds a lot of bloat to overall installations by forcing users to have gcc and g++ to be able to install numpy and pandas.

Proposed Solution

It's quite simple to actually do this.

In setup.py you'd have to change install_requires and add extras_require:

setup(
    # [...]
    install_requires = [
        "boto3>=1.14.38",
        "pyyaml"
    ]
    extras_require = {
        "sagemaker":  ["sagemaker>=2.1.0"],
        "test": [] # [...]
    }
   # [...]
)

In addition to this, some extra documentation should probably be added to the README to make it easier for users to understand how to install the extra dependency (pip install stepfunctions[sagemaker]).

Other

If this is implemented, the major version probably has to be bumped because it's a change in how the package is installed.



This is a :rocket: Feature Request

wong-a commented 2 years ago

Makes sense. We received similar request in https://github.com/aws/aws-step-functions-data-science-sdk-python/issues/116

I agree this would be a breaking change. I'm not sure how many customers are using this SDK without sagemaker. So far, I've only seen this request from 3 individuals.

nicolaei commented 2 years ago

I did some testing, and it looks like some of the implicit imports in stepfunctions.steps would also need to be conditional if this change is to be implemented. Specifically the ones around sagemaker.

If you want me to, I'll can look more into it and send a PR next week!


As a side-note, here is a bit more info about my use-case:

We're currently using step functions as a deployment pipeline inside our accounts. I'm currently doing a rewrite to make the pipeline more powerful by using this library to dynamically create step functions based on a configuration file for each repo.

The main problem right now is that I have to build the lambda inside of a docker container running terraform. This container doesn't have gcc, g++ or make by default, so I have to install it at runtime to avoid redoing a bunch of container images. Numpy and pandas also take a long time to install 😞

Luckely, I saw the announcement about lambdas being able to use ECR in other account a few days ago, so that should mostly alleviate the problem for me when I get it implemented (we're using an account per team, per environment).

Here is the current work in progress, if you're interested: nsbno/terraform-aws-delivery-pipeline.

DrewThomasCorps commented 2 years ago

My team would love if this work was completed and merged. We prefer writing amazon states language in Python as opposed to JSON, and this gives us the functionality to do that. An implementation detail prevents us from using the CDK where the step functions are defined, so making this SDK lighter would make it easier to use.

wong-a commented 2 years ago

An implementation detail prevents us from using the CDK where the step functions are defined

Can you elaborate on what issue prevents you from using CDK? @DrewThomasCorps

DrewThomasCorps commented 2 years ago

For the public record, this conversation was held offline. CDK also adds some bloat and in our case it's preferable to write Amazon States Language with a simple library, rather than requiring constructs, IDs, and stacks.

This is not a pressing concern and there are workarounds in place so we do not need this immediately resolved, but will be keeping an eye on it.