aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
10.1k stars 6.76k forks source link

[Bug Report] 'text/csv; charset=utf-8` is not supported in Sagemaker Pipeline with sklearn and xgboost models #3235

Open bevhanno opened 2 years ago

bevhanno commented 2 years ago

Hi, I have created a Sagemaker Pipeline Model using an Sklearn model followed by an xgboost model. I followed the instructions here to set the 'SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT' environment variable but I'm getting an

ValueError: Content type text/csv; charset=utf-8 is not supported. error when running the batch transform job on the 2nd (xgboost) container that is following the sklearn container.

My pipeline code looks as following:

feature_model = SKLearnModel(
    model_data=feature_model_s3_path,
    sagemaker_session=sagemaker_session,
    role=role,
    framework_version="0.23-1",
    entry_point = os.path.join(BASE_DIR, "scripts", "sagemaker_feature_transform.py")
 )

 feature_model.env = {"SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT":"text/csv"}

 model = XGBoostModel(
     framework_version="1.0-1",
     model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
     sagemaker_session=sagemaker_session,
     entry_point= os.path.join(BASE_DIR, "scripts", "sagemaker_xgb_training.py"),
     role=role
 )

 pipeline_model = PipelineModel(
     name= "pipeline-model",
     role=role,
     models=[feature_model, model],
     sagemaker_session=sagemaker_session
 )

As inference output code of container 1 (sklearn) I am using:

from sagemaker_containers.beta.framework import ( encoders, worker)

def output_fn(prediction, accept):
    if accept == "text/csv":
        return worker.Response(encoders.encode(prediction, accept), mimetype=accept)
    else:
        raise RuntimeException("{} accept type is not supported by this script.".format(accept))

As inference input code of container 2 (xgb) I am using:

def input_fn(request_body, request_content_type):

    if request_content_type == "text/libsvm":
        return xgb_encoders.libsvm_to_dmatrix(request_body)
    elif request_content_type == "text/csv":
        return xgb_encoders.csv_to_dmatrix(request_body)
    else:
        raise ValueError("Content type {} is not supported.".format(request_content_type))

It seems like even though I am forcing the output content type of container 1 to "text/csv", what is arriving in container 2 is an unkown "text/csv; charset=utf-8" format. Any ideas of what I am doing wrong ?

Thank you for your help!

bevhanno commented 2 years ago

I solved the problem by adding the following code to the xgboost inference script "sagemaker_xgb_training.py":

def rchop(s, suffix):
    if suffix and s.endswith(suffix):
        return s[:-len(suffix)]
    return s

def input_fn(request_body, request_content_type):

    if request_content_type == "text/csv; charset=utf-8":
        request_body = request_body.decode('utf-8')
        request_body = rchop(request_body, '\n') 
        return xgb_encoders.csv_to_dmatrix(request_body)

    else:
        raise ValueError("Content type {} is not supported.".format(request_content_type))

This adds the missing "text/csv; charset=utf-8" content type, decodes the request body and removes ending "\n" characters before calling the xgb_encoder.

Kyparos commented 1 year ago

I experience same issue, but while using sklearn-preprossesor -> LGMB pipeline.

alishh commented 1 year ago

@Kyparos Did you resolve your issue ? Even I am facing the same problem with sklearnPreprocessor -> sagemaker LGMB

oleg131 commented 1 year ago

I believe the problem is due to Pipeline Models honoring input and output content types but not in-between.

I.e. when you start a Batch Transform, you can set it to input content type to be text/csv and output content type to be text/csv. However, this will not set output content type to text/csv on the first container (you can verify by logs), therefore it will revert back to application/json, thus making the 2nd container fail.