aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
10.13k stars 6.77k forks source link

Tensorflow model.save() fails in SM script mode #1478

Open kafka399 opened 4 years ago

kafka399 commented 4 years ago

Hello,

The following setup fails on model saving:

Calling script file:

tf_estimator = TensorFlow(entry_point='script_one.py',   
                      role=role,  
                      train_instance_count=1,   
                      train_instance_type='ml.p3.2xlarge',  
                      framework_version='2.3.0',   
                      py_version='py37',  
                      script_mode=True,
                      train_use_spot_instances = True,  
                      train_max_wait= 36000,  
                      train_max_run  =36000,  
                      hyperparameters={. 
                          'dropout':0.22350414495308987,  
                          'epochs': 33,  
                          'batch-size': 657,  
                          'learning-rate': 0.01}. 
                     )

Last part of the scrip file:

#tf.saved_model.save(model, os.path.join(model_dir, 'model/1'))
model.save(os.path.join(model_dir, '000000001'), 'my_model.h5')

Both methods fail with the following error in SM, KeyError: 'callable_inputs':

2020-08-31 12:34:23 Uploading - Uploading generated training model/opt/ml/model/model/1
Traceback (most recent call last):
  File "script_one.py", line 210, in <module>
model.save(os.path.join(model_dir, '000000001'), 'my_model.h5')
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 2005, in save
  signatures, options)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/save.py", line 134, in save_model
  signatures, options)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save.py", line 80, in save
  save_lib.save(model, filepath, signatures, options)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/save.py", line 976, in save
  obj, export_dir, signatures, options, meta_graph_def)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/save.py", line 1047, in _build_meta_graph
  checkpoint_graph_view)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/signature_serialization.py", line 75, in find_function_to_export
  functions = saveable_view.list_functions(saveable_view.root)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/saved_model/save.py", line 145, in list_functions
  self._serialization_cache)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 2616, in _list_functions_for_serialization
  Model, self)._list_functions_for_serialization(serialization_cache)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 3019, in _list_functions_for_serialization
  .list_functions_for_serialization(serialization_cache))
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/base_serialization.py", line 87, in list_functions_for_serialization
  fns = self.functions_to_serialize(serialization_cache)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/layer_serialization.py", line 79, in functions_to_serialize
  serialization_cache).functions_to_serialize)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/layer_serialization.py", line 95, in _get_serialized_attributes
  serialization_cache)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/model_serialization.py", line 57, in _get_serialized_attributes_internal
  serialization_cache))
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/layer_serialization.py", line 104, in _get_serialized_attributes_internal
  functions = save_impl.wrap_layer_functions(self.obj, serialization_cache)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 155, in wrap_layer_functions
  original_fns = _replace_child_layer_functions(layer, serialization_cache)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 274, in _replace_child_layer_functions
  serialization_cache).functions)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/layer_serialization.py", line 95, in _get_serialized_attributes
  serialization_cache)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/model_serialization.py", line 57, in _get_serialized_attributes_internal
  serialization_cache))
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/layer_serialization.py", line 104, in _get_serialized_attributes_internal
  functions = save_impl.wrap_layer_functions(self.obj, serialization_cache)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 165, in wrap_layer_functions
  '{}_layer_call_and_return_conditional_losses'.format(layer.name))
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 505, in add_function
  self.add_trace(*self._input_signature)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 420, in add_trace
  trace_with_training(True)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 418, in trace_with_training
  fn.get_concrete_function(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 549, in get_concrete_function
  return super(LayerCall, self).get_concrete_function(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 1167, in get_concrete_function
  concrete = self._get_concrete_function_garbage_collected(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 1073, in _get_concrete_function_garbage_collected
  self._initialize(args, kwargs, add_initializers_to=initializers)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 697, in _initialize
  *args, **kwds))
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2855, in _get_concrete_function_internal_garbage_collected
  graph_function, _, _ = self._maybe_define_function(args, kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3213, in _maybe_define_function
  graph_function = self._create_graph_function(args, kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3075, in _create_graph_function
  capture_by_value=self._capture_by_value),
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 986, in func_graph_from_py_func
  func_outputs = python_func(*func_args, **func_kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 600, in wrapped_fn
  return weak_wrapped_fn().__wrapped__(*args, **kwds)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 515, in wrapper
  inputs = call_collection.get_input_arg_value(args, kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/saving/saved_model/save_impl.py", line 455, in get_input_arg_value
  self._input_arg_name, args, kwargs, inputs_in_args=True)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 2536, in _get_call_arg_value
  return args_dict[arg_name]

KeyError: 'callable_inputs'

kafka399 commented 4 years ago

If you are ok with TF 2.1 version, here is the solution:

tf_estimator = TensorFlow(entry_point='script_one.py', role=role, train_instance_count=1, train_instance_type='ml.p3.2xlarge', framework_version='2.1.0', py_version='py3', script_mode=True, train_use_spot_instances = True, train_max_wait= 36000, train_max_run =36000, hyperparameters={ 'dropout':0.22350414495308987, 'epochs': 33, 'batch-size': 657, 'learning-rate': 0.01} ) Now, the question remains why it doesn't work with TF 2.2 or 2.3

hm-haitham commented 4 years ago

Are you using debugger callbacks ? If you do try to remove them

svpino commented 4 years ago

Thanks, @hm-haitham for the tip. I can confirm that disabling debugger callbacks fixes the issue. Using @kafka399's estimator as an example, here is how the fix will look like:

tf_estimator = TensorFlow(
    entry_point='script_one.py',   role=role,  
    train_instance_count=1,   
    train_instance_type='ml.p3.2xlarge',  
    framework_version='2.3.0',   
    py_version='py37',  
    script_mode=True,
    train_use_spot_instances = True,  
    train_max_wait= 36000,  
    train_max_run  =36000,  
    debugger_hook_config=False,
    hyperparameters={
        'dropout':0.22350414495308987,  
        'epochs': 33,  
        'batch-size': 657,  
        'learning-rate': 0.01
    } 
)

Notice the debugger_hook_config setting.