cybertronai / gradient-checkpointing

Make huge neural nets fit in memory
MIT License
2.71k stars 270 forks source link

Rerunning from Checkpoint Gives Error #39

Open ghost opened 5 years ago

ghost commented 5 years ago

When attempting to restart learning from a checkpoint, I get the following error (please let me know what other information I can provide):

Traceback (most recent call last):
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1356, in _do_call
    return fn(*args)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.FailedPreconditionError: 2 root error(s) found.
  (0) Failed precondition: Attempting to use uninitialized value conv01/kernel
         [[{{node conv01/kernel}}]]
         [[conv01/kernel/_3]]
  (1) Failed precondition: Attempting to use uninitialized value conv01/kernel
         [[{{node conv01/kernel}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main_run_0008.py", line 1599, in <module>
    main()
  File "main_run_0008.py", line 1519, in main
    'model_init_' + hparam_str[0] + '_run_{:04d}_'.format(run[0]) + ts + '.h5') )
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\network.py", line 1090, in save
    save_model(self, filepath, overwrite, include_optimizer)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 382, in save_model
    _serialize_model(model, f, include_optimizer)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 97, in _serialize_model
    weight_values = K.batch_get_value(symbolic_weights)
  File "C:\Program Files\Python36\lib\site-packages\keras\backend\tensorflow_backend.py", line 2420, in batch_get_value
    return get_session().run(ops)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\client\session.py", line 950, in run
    run_metadata_ptr)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_run
    run_metadata)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: 2 root error(s) found.
  (0) Failed precondition: Attempting to use uninitialized value conv01/kernel
         [[node conv01/kernel (defined at C:\Program Files\Python36\lib\site-packages\keras\backend\tensorflow_backend.py:402) ]]
         [[conv01/kernel/_3]]
  (1) Failed precondition: Attempting to use uninitialized value conv01/kernel
         [[node conv01/kernel (defined at C:\Program Files\Python36\lib\site-packages\keras\backend\tensorflow_backend.py:402) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'conv01/kernel':
  File "main_run_0008.py", line 1599, in <module>
    main()
  File "main_run_0008.py", line 1390, in main
    train_model = load_model( args.checkpoint )
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 419, in load_model
    model = _deserialize_model(f, custom_objects, compile)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 225, in _deserialize_model
    model = model_from_config(model_config, custom_objects=custom_objects)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 458, in model_from_config
    return deserialize(config, custom_objects=custom_objects)
  File "C:\Program Files\Python36\lib\site-packages\keras\layers\__init__.py", line 55, in deserialize
    printable_module_name='layer')
  File "C:\Program Files\Python36\lib\site-packages\keras\utils\generic_utils.py", line 145, in deserialize_keras_object
    list(custom_objects.items())))
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\network.py", line 1032, in from_config
    process_node(layer, node_data)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\network.py", line 991, in process_node
    layer(unpack_singleton(input_tensors), **kwargs)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\base_layer.py", line 431, in __call__
    self.build(unpack_singleton(input_shapes))
  File "C:\Program Files\Python36\lib\site-packages\keras\layers\convolutional.py", line 141, in build
    constraint=self.kernel_constraint)
  File "C:\Program Files\Python36\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\base_layer.py", line 252, in add_weight
    constraint=constraint)
  File "C:\Program Files\Python36\lib\site-packages\keras\backend\tensorflow_backend.py", line 402, in variable
    v = tf.Variable(value, dtype=tf.as_dtype(dtype), name=name)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\variables.py", line 259, in __call__
    return cls._variable_v1_call(*args, **kwargs)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\variables.py", line 220, in _variable_v1_call
    shape=shape)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\variables.py", line 198, in <lambda>
    previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\variable_scope.py", line 2511, in default_variable_creator
    shape=shape)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\variables.py", line 263, in __call__
    return super(VariableMetaclass, cls).__call__(*args, **kwargs)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\variables.py", line 1568, in __init__
    shape=shape)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\variables.py", line 1728, in _init_from_args
    name=name)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\state_ops.py", line 79, in variable_op_v2
    shared_name=shared_name)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\gen_state_ops.py", line 2024, in variable_v2
    shared_name=shared_name, name=name)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 3616, in create_op
    op_def=op_def)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()
yaroslavvb commented 5 years ago

Hm, I'm assuming you ran tf.initialize_all_variables()?

ghost commented 5 years ago

If you initalized variables, wouldn't that overwrite the saved weights and biases in the loaded model?