aws-samples / sagemaker-custom-project-templates

MIT No Attribution
214 stars 155 forks source link

[Feature] A 'feature pipeline' template #72

Open athewsey opened 1 year ago

athewsey commented 1 year ago

For my use case, we're looking for a deployable project template/stack which:

We started from the SageMaker built-in 'model building and training' template, which already helps a lot with the SageMaker Pipeline / processing jobs aspect... But as far as I can tell there isn't one that covers managing a feature group (schema, tags, feature-level metadata) via CI/CD which is a shame.

I've put some initial thought into what a transferable sample could look like, but haven't had time yet to draft up an attempt!

Design ideas

Feature Group management

  1. As of today, CloudFormation AWS::SageMaker::FeatureGroup gives a native mechanism to create/update/delete Feature Groups, but as far as I can tell it doesn't support feature-level metadata (parameters and descriptions) which are important for enterprise usage.
  2. I suspect different organizations might wish to apply different levels of safety checks to the CloudFormation default, for actions that might result in deleting and replacing the Feature Group (I assume CFn wouldn't do automatic data replication in this case?)

For these two reasons, I was leaning towards defining a custom feature group+feature metadata configuration e.g. in JSON/YAML, and using custom Python code in CodeBuild to reconcile the current vs target feature group config and either perform the necessary updates or fail.

To avoid trying to overload parameters of the Service Catalog template itself, I was thinking such a template could take only basic parameters like the feature group name, ID field name and type, etc... Not creating any actual features until the initial 'seed code' config file is updated to add features in (since adding features to a group is a supported operation but removing them is not).

Feature transformation and ingestion

In our environment we're currently using two-step SageMaker Pipelines for feature engineering: One processing job to extract and transform raw data (from S3), and a separate one to do the feature store ingestion - to keep this component as easily re-usable as possible and separately right-size infrastructure. We've so far experimented with both PySparkProcessor and Data Wrangler for these steps, so that general architecture was my assumed pattern for the actual feature ingestion pipeline.

One or many feature groups?

I guess there's probably no harm in such a template theoretically supporting multiple pipelines, multiple feature groups - just like the model building & training template currently supports multiple pipelines... But our planned process is pretty much to deploy a separate project template per feature group at this stage.