aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.11k stars 1.14k forks source link

Improved ways of storing local code in S3 for ProcessingSteps #4879

Open HFulcher opened 1 month ago

HFulcher commented 1 month ago

Describe the feature you'd like Currently, when using Processors such as SKLearnProcessor there is no way to specify where a local code= file should be stored in S3 when used in conjunction with a ProcessingStep. This can lead to clutter in S3 buckets, for example. The current behaviour places code in the default_bucket of a Sagemaker session like so:

s3://{default_bucket}/auto_generated_hash/input/code/preprocess.py

A better user experience would be to allow the user to define exactly where the code should be uploaded. This allows users to group files together for each run. For example:

s3://{specified_bucket}/{project_name}/PIPELINE_EXECUTION_ID/code/preprocess.py s3://{specified_bucket}/{project_name}/PIPELINE_EXECUTION_ID/data/train.csv s3://{specified_bucket}/{project_name}/PIPELINE_EXECUTION_ID/model/model.pkl

This should already be possible with the FrameworkProcessor and utilising the code_location= parameter but this seems to be ignored by the ProcessingStep.