aws / sagemaker-tensorflow-training-toolkit

Toolkit for running TensorFlow training scripts on SageMaker. Dockerfiles used for building SageMaker TensorFlow Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
270 stars 160 forks source link

Python 3 : sagemaker-tensorflow-scriptmode:1.11.0-cpu-py3 fails #130

Closed elangovana closed 4 years ago

elangovana commented 5 years ago

Hi Team, Is python 3 officially supported??

I have tried using the following samples provided as part of Sagemaker sample notebooks

a. tensorflow_abalone_age_predictor_using_keras

b. tensorflow_keras_CIFAR10

I have also tried with my own entry point

Steps to reproduce:

  1. Change the estimator call to use py3 and train and deploy the model for the 2 sample notebooks, tensorflow_abalone_age_predictor_using_keras and tensorflow_keras_CIFAR10

    estimator = TensorFlow(entry_point='cifar10_cnn.py',
                           role=role,
                           framework_version='1.11.0',
                            py_version="py3",  
                           hyperparameters={'learning_rate': 1e-4, 'decay':1e-6},
                           train_instance_count=1, train_instance_type='ml.c4.xlarge')
    estimator.fit(inputs)
    predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

The training completes successfully using the container sagemaker-tensorflow-scriptmode:1.11.0-cpu-py3, but deploy fails with error

    ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Could not find model data at s3://**mybucket...**/sagemaker-tensorflow-scriptmode-2018-12-01-02-08-33-766/output/model.tar.gz.
  1. I tried to use my own entry point file (see below), where I save the model to the Sagemaker Env model path os.environ.get('SM_MODEL_DIR', None) value. In this instance the estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge') is able to create the model, but the endpoint fails with error

       Traceback (most recent call last):
        File "/sagemaker/serve.py", line 178, in <module>
        ServiceManager().start()
        File "/sagemaker/serve.py", line 152, in start
        self._create_tfs_config()
        File "/sagemaker/serve.py", line 53, in _create_tfs_config
        raise ValueError('no SavedModel bundles found!')"

Note: I have a Keras snapshot and the container (sagemaker-tensorflow-serving:1.11.0-cpu) doesn't seem to take into account my model_fn and uses its own which looks for a file named "saved_model.pb". How can I override the default model??

import argparse
import glob
import json
import logging
import os
import sys
from io import StringIO

import numpy as np

from constants import SCALE_FACTOR

def model_fn(model_dir):
    from keras.models import load_model
    # locate model file from dir
    model_files = glob.glob(os.path.join(model_dir, '*.hdf5'), recursive=False)
    assert len(model_files) == 1, "Expecting just one model file but found {} files, {}".format(len(model_files),
                                                                                                model_files)

    # load model
    model = load_model(model_files[0])

    return model

def input_fn(input_data, content_type):
    """
    Format the input from binary to numpy array
    :param input_data:
    :param content_type:
    :return:
    """
    if content_type == "application/json":
        data = np.array(json.loads(StringIO(input_data.decode("utf-8")).getvalue()))
    else:
        raise Exception("The content_type  {} is not supported".format(content_type))
    return data

def predict_fn(input_object, model):
    return model.predict(input_object) * SCALE_FACTOR

# Serialize the prediction result into the desired response content type
def output_fn(prediction, response_content_type):
    logger = logging.getLogger(__name__)
    logger.debug("Calling output for content_type ..{}".format(response_content_type))
    prediction_list = prediction.tolist()
    if response_content_type == "application/json" or response_content_type == "*/*" or response_content_type == "":
        return json.dumps(prediction_list)

    raise Exception("The response content_type  {} is not supported".format(response_content_type))

if '__main__' == __name__:

    parser = argparse.ArgumentParser()

    parser.add_argument("--traindata", help="The input file wrt to the training directory", required=True)
    parser.add_argument('--traindata-dir',
                        help='The directory containing training artifacts such as training data',
                        default=os.environ.get('SM_CHANNEL_TRAIN', "."))
    parser.add_argument("--outputdir", help="The output dir to save results",
                        default=os.environ.get('SM_OUTPUT_DATA_DIR', "result_data")
                        )

    parser.add_argument("--model_dir", help="Do not use this.. required by sagemaker", default=None)

    parser.add_argument("--snapshot_dir", help="The directory to save the snapshot to..",
                        default=os.environ.get('SM_MODEL_DIR', None))

    parser.add_argument("--restore-model-file", help="The model file", default=None)
    parser.add_argument("--restore-weights-file", help="The weights file", default=None)
    parser.add_argument("--restore-iter-file", help="The iter file which contains the number of epochs last run",
                        default=None)
    parser.add_argument("--epochs", help="The number of epochs", default=10, type=int)
    parser.add_argument("--batch-size", help="The mini batch size", default=30, type=int)
    parser.add_argument("--log-level", help="Log level", default="INFO", choices={"INFO", "WARN", "DEBUG", "ERROR"})

    args = parser.parse_args()

    # Set up logging
    logging.basicConfig(level=logging.getLevelName(args.log_level), handlers=[logging.StreamHandler(sys.stdout)],
                        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

    # set up dir
    if not os.path.isdir(args.outputdir):
        os.makedirs(args.outputdir)

    # if no model dir provided, Save the snapshot to output dir
    model_snapshot_dir = args.snapshot_dir if args.snapshot_dir is not None else args.outputdir

    logging.info("Running with args {}".format(args.__dict__))

    # Start process here
    from train import Train

    trainer = Train()

    trainfilepath = os.path.join(args.traindata_dir, args.traindata)

    trainer.train(inputfile=trainfilepath, outputdir=args.outputdir, model_snapshot_dir=model_snapshot_dir,
                  restore_model_file=args.restore_model_file, restore_weights_file=args.restore_weights_file,
                  restore_epoch_file=args.restore_iter_file,
                  epochs=args.epochs, batch_size=args.batch_size)
icywang86rui commented 5 years ago

Hi,

Python 3 is officially supported with Script Mode Only you can find the documentation here: https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/tensorflow

Script Mode requires a different format of the training script so some of the the Legacy Mode scripts might not work with Script Mode. Script Mode simply just runs the training script. It does not save the model so the training script has to export the model to '/opt/ml/model' at end of training for SageMaker to save it as the model artifact and upload to S3.

As for serving, Script Mode containers only supports serving with the rest API TensorFlow-serving based containers. Since TensorFlow serving can only serve pb bundle models the model saved to /opt/ml/model has to be in the same format. We are working on some new sample notebooks for Script Mode at the moment. They will be available sometime this week or early next week.

Thanks for using Script Mode and providing feedbacks. Please let us know if you have more questions.

ryanpeach commented 5 years ago

Where is an example of the Dockerfile that runs script mode py3 (for modification). All I see here is python 2 containers.

laurenyu commented 5 years ago

@ryanpeach the script mode Dockerfiles are in the script-mode branch of this repo: https://github.com/aws/sagemaker-tensorflow-container/tree/script-mode/docker

jzhang-gp commented 5 years ago

I ran into the same issue during deployment. Is there any update on the serving of model trained by the Script Mode?

If it's not ready, can we use the trained model by the Script Mode outside Sagemaker to run predictions?

Thanks!

laurenyu commented 4 years ago

(I realize this is a very late response, but just in case someone else stumbles across here looking for the answer...)

models trained using the Script Mode image can be hosted using the TensorFlow Serving images: https://github.com/aws/sagemaker-tensorflow-serving-container