aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
9.95k stars 6.73k forks source link

[Example Request] SM Pipeline with built-in LightGBM, AutoGluon, CatBoost, TabTransformer algorithm #3693

Open athewsey opened 1 year ago

athewsey commented 1 year ago

Describe the use case example you want to see

A SageMaker Pipeline to train, evaluate, and register a model using one (or more?) of the new JumpStart-based built-in algorithms for tabular data: Preferably via the SageMaker SDK + PipelineSession.

How would this example be used? Please describe.

The new JumpStart-based tabular built-in algorithms (AutoGluon-Tabular, CatBoost, LightGBM, TabTransformer) have some extra usage complexities beyond XGBoost:

We have sample notebooks available for these algorithms, usually listed on the algorithm doc pages themselves e.g. here for AutoGluon... But as far as I've found, the only samples for SM Pipelines tend to be XGBoost-based or using custom models.

The extra complexity (around image, script and model artifact URIs in particular) can make it a challenge for customers who aren't yet familiar with script mode (only trying out and comparing built-in algorithms) to get started with these more advanced tabular algorithms: It's not straightforward today, to take an XGBoost sample and just plug in a different algorithm name.

So I suggest it'd be helpful to either extend an existing sample, or add a new sample, to show how pipelining translates from XGBoost to the other tabular algorithms?

Describe which SageMaker services are involved

**Describe what other services (other than SageMaker) are involved***

Describe which dataset could be used. Provide its location in s3://sagemaker-sample-files or another source.

anand086 commented 8 months ago

+1

Eduarcher commented 2 months ago

Since I was able to put together a simple pipeline with LightGBM using some docs, I'm sourcing it here for anyone in need:

from sagemaker import Session, image_uris, script_uris, model_uris, hyperparameters
from sagemaker.estimator import Estimator
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.steps import TrainingStep

sess = Session()
aws_region = "us-east-1"
aws_role="AWS_ROLE"
train_model_id, train_model_version, train_scope = "lightgbm-classification-model", "*", "training"
training_instance_type = "ml.m5.xlarge"

training_data_prefix = "training-datasets/tabular_multiclass/"
training_dataset_s3_path = f"s3://jumpstart-cache-prod-{aws_region}/{training_data_prefix}train" 
validation_dataset_s3_path = f"s3://jumpstart-cache-prod-{aws_region}/{training_data_prefix}validation" 
output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-tabular-training"
s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

# Retrieve the default hyperparameters for training the model
hyperparams = hyperparameters.retrieve_default(
    model_id=train_model_id, model_version=train_model_version
)
# [Optional] Override default hyperparameters with custom values
hyperparams["num_boost_round"] = "500"

train_image_uri = image_uris.retrieve(
    region=None,
    framework=None,
    model_id=train_model_id,
    model_version=train_model_version,
    image_scope=train_scope,
    instance_type=training_instance_type
)

# Retrieve the training script
train_source_uri = script_uris.retrieve(
    model_id=train_model_id, model_version=train_model_version, script_scope=train_scope
)
train_model_uri = model_uris.retrieve(
    model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
)

# Create SageMaker Estimator instance
lgbm_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1, # for distributed training, specify an instance_count greater than 1
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparams,
    output_path=s3_output_location
)

step_train = TrainingStep(
    name="LGBMTraining",
    estimator=lgbm_estimator,
    inputs={
        "train": TrainingInput(
            s3_data=training_dataset_s3_path,
            content_type="text/csv"
        ),
        "validation": TrainingInput(
            s3_data=validation_dataset_s3_path,
            content_type="text/csv"
        )
    }
)

pipeline = Pipeline(
    name="TestLGBM",
    steps=[step_train],
    sagemaker_session=sess
)
pipeline.upsert(role_arn=aws_role)
start_response = pipeline.start()