aws / sagemaker-sparkml-serving-container

This code is used to build & run a Docker container for performing predictions against a Spark ML Pipeline.
Apache License 2.0
50 stars 25 forks source link

SparkMLModel SAGEMAKER_SPARK_ML_SCHEMA can only accept 16 features #12

Open rchazelle opened 4 years ago

rchazelle commented 4 years ago

Hello, I would like to understand why this limitation is in place. Presumably most machine learning models take in much more than 16 features.

I created a model and had over 100 features. I tried to pass in all those features to my SAGEMAKER_SPARK_ML_SCHEMA but got the following error:

An error occurred (ValidationException) when calling the CreateModel operation: 1 validation error detected: Value '{SAGEMAKER_SPARKML_SCHEMA={"input": [list_of_column_names_and_types_omitted_due_to_privacy], "output": {"type": "double", "name": "prediction"}}}' at 'primaryContainer.environment' failed to satisfy constraint: Map value must satisfy constraint: [Member must have length less than or equal to 1024, Member must have length greater than or equal to 0, Member must satisfy regular expression pattern: [\S\s]*]
Traceback (most recent call last):
  File "<stdin>", line 46, in deploy_model
  File "/usr/local/lib/python2.7/site-packages/sagemaker/model.py", line 479, in deploy
    self._create_sagemaker_model(instance_type, accelerator_type, tags)
  File "/usr/local/lib/python2.7/site-packages/sagemaker/model.py", line 195, in _create_sagemaker_model
    tags=tags,
  File "/usr/local/lib/python2.7/site-packages/sagemaker/session.py", line 2125, in create_model
    self.sagemaker_client.create_model(**create_model_request)
  File "/usr/local/lib/python2.7/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python2.7/site-packages/botocore/client.py", line 661, in _make_api_call
    raise error_class(parsed_response, operation_name)
ClientError: An error occurred (ValidationException) when calling the CreateModel operation: 1 validation error detected: Value '{SAGEMAKER_SPARKML_SCHEMA={"input": [list_of_column_names_and_types_omitted_due_to_privacy], "output": {"type": "double", "name": "prediction"}}}' at 'primaryContainer.environment' failed to satisfy constraint: Map value must satisfy constraint: [Member must have length less than or equal to 1024, Member must have length greater than or equal to 0, Member must satisfy regular expression pattern: [\S\s]*]

list_of_column_names_omitted_due_to_privacy is the correctly formatted input, the names are not > 1024 characters, all of them are less than 50 chacters.

This led me to some googling and I found the following at: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html under create_model

Environment (dict) -- The environment variables to set in the Docker container. Each key and value in the Environment string to string map can have length of up to 1024. We support up to 16 entries in the map.

So I reduced the number to features to 15 and it works. How can I make this work for 100+ features? My pipeline includes a bunch of StringIndexers -> OneHotEncoderEstimators.

I tried to increase it to 17, that worked. I tried 53 next, that didn't work. 117 was what I first tried and that also doesn't work.

orchidmajumder commented 4 years ago

For you, right now, I feel the best bet would be to build a Docker image using the code from this repository and then define the schema as environment variable in your Dockerfile itself. The limitation you are facing is of SageMaker platform, not this library per se.

rchazelle commented 4 years ago

Sweet thanks for the response. Is there a github for that or should I reach out to AWS directly?

orchidmajumder commented 4 years ago

That's part of the standard AWS SDK for SageMaker. You probably need to reach out to AWS for that to pass the request on to the appropriate service team.

chelseacjole1 commented 4 years ago

Same issue here