aws / sagemaker-sparkml-serving-container

This code is used to build & run a Docker container for performing predictions against a Spark ML Pipeline.
Apache License 2.0
52 stars 25 forks source link

How to pass multiple inputs into SparkML model in a single request #6

Open vincentnqt opened 5 years ago

vincentnqt commented 5 years ago

I am trying to create Pipeline model that combines the SparkML model with the BlazingText model for a text classification task. The SparkML model is used to pre-process the input texts. I configured SparkML as follows:

Accept: application/jsonlines;data=text

Schema:

{"input": [{"name": "description", "type": "string"}], 
"output": {"name": "tokenized_description", "type": "string", "struct": "array"}}

However, when I try to pass multiple inputs into the model, it only returns a single prediction output:

Sample inputs: {"data": ["this tone is catchy", "the lyrics are terrible"]} Output: {"label": ["__label__1"], "prob": [1.0000100135803223]}

When it is supposed to return 2 prediction outputs as: [{"label": ["__label__1"], "prob": [1.0000100135803223]}, {"label": ["__label__0"], "prob": [1.0000100195819324]}]

I tried this with a stand-alone BlazingText model and it was able to return multiple prediction outputs, so I think it must be an issue with SparkML.

Does anyone know how I can pass multiple inputs into SparkML, or is it possible at all?

agodet commented 5 years ago

Hi Vincent, We are facing the same problem, how did you resolve it ? Thanks.

orchidmajumder commented 5 years ago

Hi, at this point, there is no option to pass multiple rows as part of a single request. You can either pass them one by one or use SageMaker Batch Transform which takes care of automatically batching your requests.

vincentnqt commented 5 years ago

Hi, at this point, there is no option to pass multiple rows as part of a single request. You can either pass them one by one or use SageMaker Batch Transform which takes care of automatically batching your requests.

Hi, thank you for the information. May I know if this single input limitation is just for SparkML or for Inference Pipeline as well? I am trying to find out if it's possible to pass multiple requests into an Inference Pipeline at all.

orchidmajumder commented 5 years ago

This is only a restriction for the SparkML container, not in general for the Inference Pipeline. As long as your Inference Pipeline does not contain this SparkML container, there is no restriction of single datapoint per request.

agodet commented 5 years ago

I push a new merge request to answer this question. If you're already interested with processing multiple request at the same time, you can now use https://github.com/agodet/sagemaker-sparkml-serving-container/tree/csv_and_json_multilines. You can use csv multilines or jsonlines.