[Example Request] SM Pipeline with built-in LightGBM, AutoGluon, CatBoost, TabTransformer algorithm

aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.

Apache License 2.0

9.95k stars 6.73k forks source link

Describe the use case example you want to see

A SageMaker Pipeline to train, evaluate, and register a model using one (or more?) of the new JumpStart-based built-in algorithms for tabular data: Preferably via the SageMaker SDK + PipelineSession.

How would this example be used? Please describe.

The new JumpStart-based tabular built-in algorithms (AutoGluon-Tabular, CatBoost, LightGBM, TabTransformer) have some extra usage complexities beyond XGBoost:

Separate container image URIs must be used for training vs inference, otherwise errors will generally be thrown due to missing libraries/executables/etc.
Script bundles must be looked up (via e.g. sagemaker.script_uris.retrieve()) and provided to both the training and inference stages - and also the models created by these training jobs appear to require re-packing to properly insert inference scripts.
"Pre-trained" model artifacts seem to be mandatory (via e.g. sagemaker.model_uris.retrieve()) for the training job.
Data channel structure is different, using a single training channel with specifically named subfolders and files, instead of separate train, validation, etc channels.

We have sample notebooks available for these algorithms, usually listed on the algorithm doc pages themselves e.g. here for AutoGluon... But as far as I've found, the only samples for SM Pipelines tend to be XGBoost-based or using custom models.

The extra complexity (around image, script and model artifact URIs in particular) can make it a challenge for customers who aren't yet familiar with script mode (only trying out and comparing built-in algorithms) to get started with these more advanced tabular algorithms: It's not straightforward today, to take an XGBoost sample and just plug in a different algorithm name.

So I suggest it'd be helpful to either extend an existing sample, or add a new sample, to show how pipelining translates from XGBoost to the other tabular algorithms?

Describe which SageMaker services are involved

Pipelines
Built-in algorithms (JumpStart-based)

**Describe what other services (other than SageMaker) are involved***

None?

Describe which dataset could be used. Provide its location in s3://sagemaker-sample-files or another source.

Could just re-use ones already used, e.g. abalone as used here? California housing as used here? or etc.

from sagemaker import Session, image_uris, script_uris, model_uris, hyperparameters from sagemaker.estimator import Estimator from sagemaker.workflow.pipeline import Pipeline from sagemaker.inputs import TrainingInput from sagemaker.workflow.steps import TrainingStep sess = Session() aws_region = "us-east-1" aws_role="AWS_ROLE" train_model_id, train_model_version, train_scope = "lightgbm-classification-model", "*", "training" training_instance_type = "ml.m5.xlarge" training_data_prefix = "training-datasets/tabular_multiclass/" training_dataset_s3_path = f"s3://jumpstart-cache-prod-{aws_region}/{training_data_prefix}train" validation_dataset_s3_path = f"s3://jumpstart-cache-prod-{aws_region}/{training_data_prefix}validation" output_bucket = sess.default_bucket() output_prefix = "jumpstart-example-tabular-training" s3_output_location = f"s3://{output_bucket}/{output_prefix}/output" # Retrieve the default hyperparameters for training the model hyperparams = hyperparameters.retrieve_default( model_id=train_model_id, model_version=train_model_version ) # [Optional] Override default hyperparameters with custom values hyperparams["num_boost_round"] = "500" train_image_uri = image_uris.retrieve( region=None, framework=None, model_id=train_model_id, model_version=train_model_version, image_scope=train_scope, instance_type=training_instance_type ) # Retrieve the training script train_source_uri = script_uris.retrieve( model_id=train_model_id, model_version=train_model_version, script_scope=train_scope ) train_model_uri = model_uris.retrieve( model_id=train_model_id, model_version=train_model_version, model_scope=train_scope ) # Create SageMaker Estimator instance lgbm_estimator = Estimator( role=aws_role, image_uri=train_image_uri, source_dir=train_source_uri, model_uri=train_model_uri, entry_point="transfer_learning.py", instance_count=1, # for distributed training, specify an instance_count greater than 1 instance_type=training_instance_type, max_run=360000, hyperparameters=hyperparams, output_path=s3_output_location ) step_train = TrainingStep( name="LGBMTraining", estimator=lgbm_estimator, inputs={ "train": TrainingInput( s3_data=training_dataset_s3_path, content_type="text/csv" ), "validation": TrainingInput( s3_data=validation_dataset_s3_path, content_type="text/csv" ) } ) pipeline = Pipeline( name="TestLGBM", steps=[step_train], sagemaker_session=sess ) pipeline.upsert(role_arn=aws_role) start_response = pipeline.start()

aws / amazon-sagemaker-examples

[Example Request] SM Pipeline with built-in LightGBM, AutoGluon, CatBoost, TabTransformer algorithm #3693