aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
10.15k stars 6.78k forks source link

SageMaker script mode inference model init needed #773

Open OmriPi opened 5 years ago

OmriPi commented 5 years ago

Hi, first post here, so please bear with me if it's not exactly according to the rules, I'll try to provide more info if needed. I have successfully trained a keras model which in turn uses a tensorflow hub model (specifically bert-tensorflow) as the first layer with SageMaker using script mode. To make this model work, in the training script I have to call these 4 lines: sess.run(tf.compat.v1.local_variables_initializer()) sess.run(tf.compat.v1.global_variables_initializer()) sess.run(tf.compat.v1.tables_initializer()) tf.keras.backend.set_session(sess) as they are required to initialise the tensorflow hub bert model. However, when the code is deployed to an endpoint, I'm getting the following error: ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from model with message "{ "error": "Error while reading resource variable bert_layer_module/bert/encoder/layer_5/attention/self/key/kernel from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/bert_layer_module/bert/encoder/layer_5/attention/self/key/kernel/N10tensorflow3VarE does not exist.\n\t [[{{node bert_layer/bert_layer_module_apply_tokens/bert/encoder/layer_5/attention/self/key/MatMul/ReadVariableOp}} = ReadVariableOp[_output_shapes=[[768,768]], dtype=DT_FLOAT, _device=\"/job:localhost/replica:0/task:0/device:CPU:0\"](bert_layer_module/bert/encoder/layer_5/attention/self/key/kernel)]]" }". See https://eu-west-2.console.aws.amazon.com/cloudwatch/home?region=eu-west-2#logEventViewer:group=/aws/sagemaker/Endpoints/Test in account 124774543455 for more information. which basically means that these lines were not run on the endpoint machine and therefore the variables have not been initialised and tensorflow can't find them. How is it possible for me to make these lines run on the endpoint machine before doing prediction, as they must be run once before doing inference? I tried having a separate inference.py script which runs these lines and feed it as an entry_point when creating the tensorflow model in the following way: tf_model = TensorFlowModel(model_data=model, role=sagemaker.get_execution_role(), entry_point='inference.py', source_dir='.', py_version='py3', env={'SAGEMAKER_REQUIREMENTS': 'requirements.txt'}) however, this didn't help. Moreover, when doing inference locally, I don't run these lines but instead run the following line: tf.keras.backend.manual_variable_initialization(True) model = tf.keras.models.load_model(checkpoint_file_name, custom_objects={'BertLayer': BertLayer}) because the load_model function runs these lines instead, and if I were to run those lines it would reset the loaded variables and the model would lose its training. In this case, since there's no load_model function with how script mode works, I don't quite understand how to do it properly.

I will really appreciate any help you could provide me with, as I've been struggling with this issue for quite a while now and the documentation is very confusing because most of it is outdated and doesn't apply to script mode.

Thank you

icywang86rui commented 5 years ago

Here is the documentation on how to prepare the inference script -

https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst#providing-python-scripts-for-prepos-processing

https://github.com/aws/sagemaker-tensorflow-serving-container#prepost-processing

jfenc91 commented 5 years ago

I just ran into this problem as well following the tensorflow serving example. I believe the problem is how the model is being saved. For my use case, the code below worked. However, it looks like you would probably need to add a few more things to the legacy_init_op.

  legacy_init_op = tf.group(tf.tables_initializer(), name='legacy_init_op')
  sess.run([tf.global_variables_initializer(), tf.tables_initializer()])
  tf.saved_model.simple_save(
    sess,
    model_path,
    inputs=inputs,
    outputs=output,
    legacy_init_op=legacy_init_op) 
chuyang-deng commented 5 years ago

Hi @jfenc91!

Yes, the problem is how the model is being saved. Our TensorFlow Serving container assumes pre-trained models and only takes care of serving requests and didn't give much information about model saving. You are welcome to make contributions to our SageMaker Python SDK.

luc-kalaora commented 4 years ago

Hi @OmriPi I try to deploy as you a bert keras model with a custom_object:

model = tf.keras.models.model_from_json(json.load(open("model.json")), 
                                        custom_objects={"BertLayer": BertLayer})

Do you finally success to do it? If yes can you help me to do it? I fail to export the Keras model to the TensorFlow ProtoBuf format.