Use CudnnGRU trained model on CPU

I am doing transfer learning on top of speech2text model defined in ds2_large_8gpus_mp.py. The trained model works perfect on the machine with GPU however doesn't run on CPU-only machine since CudnnGRU/LSTM support only GPU.

Is there any way to parse pretrained CuddnGRU layer weights into MultiRNNCell + CudnnCompatibleGRUCell? I tried setting the flag use_cudnn_rnn: False during inference time (with idea that layers are compatible) and following Interactive_Infer_example.ipynb example, however get the key error regarding rnn related layers in the checkpoint:

Caused by op 'save/RestoreV2', defined at:
  File "/usr/local/bin/gunicorn", line 11, in <module>
    sys.exit(run())
  File "/usr/local/lib/python3.5/dist-packages/gunicorn/app/wsgiapp.py", line 61, in run
    WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]").run()
  File "/usr/local/lib/python3.5/dist-packages/gunicorn/app/base.py", line 223, in run
    super(Application, self).run()
  File "/usr/local/lib/python3.5/dist-packages/gunicorn/app/base.py", line 72, in run
    Arbiter(self).run()
  File "/usr/local/lib/python3.5/dist-packages/gunicorn/arbiter.py", line 203, in run
    self.manage_workers()
  File "/usr/local/lib/python3.5/dist-packages/gunicorn/arbiter.py", line 545, in manage_workers
    self.spawn_workers()
  File "/usr/local/lib/python3.5/dist-packages/gunicorn/arbiter.py", line 616, in spawn_workers
    self.spawn_worker()
  File "/usr/local/lib/python3.5/dist-packages/gunicorn/arbiter.py", line 583, in spawn_worker
    worker.init_process()
  File "/usr/local/lib/python3.5/dist-packages/gunicorn/workers/base.py", line 129, in init_process
    self.load_wsgi()
  File "/usr/local/lib/python3.5/dist-packages/gunicorn/workers/base.py", line 138, in load_wsgi
    self.wsgi = self.app.wsgi()
  File "/usr/local/lib/python3.5/dist-packages/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
  File "/usr/local/lib/python3.5/dist-packages/gunicorn/app/wsgiapp.py", line 52, in load
    return self.load_wsgiapp()
  File "/usr/local/lib/python3.5/dist-packages/gunicorn/app/wsgiapp.py", line 41, in load_wsgiapp
    return util.import_app(self.app_uri)
  File "/usr/local/lib/python3.5/dist-packages/gunicorn/util.py", line 350, in import_app
    __import__(module)
  File "<frozen importlib._bootstrap>", line 969, in _find_and_load
  File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 665, in exec_module
  File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
  File "/app/main.py", line 25, in <module>
    deepspeech = seq2seq.DeepSpeech()
  File "/app/api/models/seq2seq.py", line 27, in __init__
    saver = tf.train.Saver(vars_S2T)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1102, in __init__
    self.build()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1114, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1151, in _build
    build_save=build_save, build_restore=build_restore)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 795, in _build_internal
    restore_sequentially, reshape)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 406, in _AddRestoreOps
    restore_sequentially)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 862, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1466, in restore_v2
    shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key ForwardPass/ds2_encoder/bidirectional_rnn/bw/multi_rnn_cell/cell_0/cudnn_compatible_gru_cell/candidate/hidden_projection/bias not found in checkpoint
     [[node save/RestoreV2 (defined at /app/api/models/seq2seq.py:27)  = RestoreV2[dtypes=[DT_HALF, DT_HALF, DT_HALF, DT_HALF, DT_HALF, ..., DT_HALF, DT_HALF, DT_HALF, DT_HALF, DT_HALF], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

You have to retrain with standard Tensorflow rnn cells and with dtype=tf.float32

Hi Boris, thanks for the quick reply!

From the tensorflow docs here:

  Cudnn RNNs have two major differences from other platform-independent RNNs tf
  provides:
  * Cudnn LSTM and GRU are mathematically different from their tf counterparts.
    (e.g. `tf.contrib.rnn.LSTMBlockCell` and `tf.nn.rnn_cell.GRUCell`.
  * Cudnn-trained checkpoints are not directly compatible with tf RNNs:
    * They use a single opaque parameter buffer for the entire (possibly)
      multi-layer multi-directional RNN; Whereas tf RNN weights are per-cell and
      layer.
    * The size and layout of the parameter buffers may change between
      CUDA/CuDNN/GPU generations. Because of that, the opaque parameter variable
      does not have a static shape and is not partitionable. Instead of using
      partitioning to alleviate the PS's traffic load, try building a
      multi-tower model and do gradient aggregation locally within the host
      before updating the PS. See https://www.tensorflow.org/performance/performance_models#parameter_server_variables
      for a detailed performance guide.
  Consequently, if one plans to use Cudnn trained models on both GPU and CPU
  for inference and training, one needs to:
  * Create a CudnnOpaqueParamsSaveable subclass object to save RNN params in
    canonical format. (This is done for you automatically during layer building
    process.)
  * When not using a Cudnn RNN class, use CudnnCompatibleRNN classes to load the
    checkpoints. These classes are platform-independent and perform the same
    computation as Cudnn for training and inference.
  Similarly, CudnnCompatibleRNN-trained checkpoints can be loaded by CudnnRNN
  classes seamlessly.
  Below is a typical workflow(using LSTM as an example):
  for detailed performance guide.
  # Use Cudnn-trained checkpoints with CudnnCompatibleRNNs
  ```python
  with tf.Graph().as_default():
    lstm = CudnnLSTM(num_layers, num_units, direction, ...)
    outputs, output_states = lstm(inputs, initial_states, training=True)
    # If user plans to delay calling the cell with inputs, one can do
    # lstm.build(input_shape)
    saver = Saver()
    # training subgraph
    ...
    # Once in a while save the model.
    saver.save(save_path)
  # Inference subgraph for unidirectional RNN on, e.g., CPU or mobile.
  with tf.Graph().as_default():
    single_cell = lambda: tf.contrib.cudnn_rnn.CudnnCompatibleLSTM(num_units)
    # NOTE: Even if there's only one layer, the cell needs to be wrapped in
    # MultiRNNCell.
    cell = tf.nn.rnn_cell.MultiRNNCell(
      [single_cell() for _ in range(num_layers)])
    # Leave the scope arg unset.
    outputs, final_state = tf.nn.dynamic_rnn(cell, inputs, initial_state, ...)
    saver = Saver()
    # Create session
    sess = ...
    # Restores
    saver.restore(sess, save_path)
  # Inference subgraph for bidirectional RNN
  with tf.Graph().as_default():
    single_cell = lambda: tf.contrib.cudnn_rnn.CudnnCompatibleLSTM(num_units)
    cells_fw = [single_cell() for _ in range(num_layers)]
    cells_bw = [single_cell() for _ in range(num_layers)]
    # Leave the scope arg unset.
    (outputs, output_state_fw,
     output_state_bw) = tf.contrib.rnn.stack_bidirectional_dynamic_rnn(
         cells_fw, cells_bw, inputs, ...)
    saver = Saver()
    # Create session
    sess = ...
    # Restores
    saver.restore(sess, save_path)

CudnnCompatibleGRUCell is used in ds2_encoder.py when "use_cudnn_rnn": False. Based on the link above it seems possible to transform weights from the checkpoint trained with "use_cudnn_rnn": True somehow. But I am not sure that all the components used are compatible.

Do you confirm retraining as the only viable solution?

Maybe there are other ways, but I am not aware of them :).

On Fri, Jan 11, 2019 at 10:25 AM Andrii Sydorchuk notifications@github.com wrote:

Hi Boris, thanks for the quick reply!

From the tensorflow docs here https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py#L73 :

Cudnn RNNs have two major differences from other platform-independent RNNs tf provides:
Cudnn LSTM and GRU are mathematically different from their tf counterparts. (e.g. tf.contrib.rnn.LSTMBlockCell and tf.nn.rnn_cell.GRUCell.

Cudnn-trained checkpoints are not directly compatible with tf RNNs:

They use a single opaque parameter buffer for the entire (possibly) multi-layer multi-directional RNN; Whereas tf RNN weights are per-cell and layer.

The size and layout of the parameter buffers may change between CUDA/CuDNN/GPU generations. Because of that, the opaque parameter variable does not have a static shape and is not partitionable. Instead of using partitioning to alleviate the PS's traffic load, try building a multi-tower model and do gradient aggregation locally within the host before updating the PS. See https://www.tensorflow.org/performance/performance_models#parameter_server_variables for a detailed performance guide. Consequently, if one plans to use Cudnn trained models on both GPU and CPU for inference and training, one needs to:

Create a CudnnOpaqueParamsSaveable subclass object to save RNN params in canonical format. (This is done for you automatically during layer building process.)
When not using a Cudnn RNN class, use CudnnCompatibleRNN classes to load the checkpoints. These classes are platform-independent and perform the same computation as Cudnn for training and inference. Similarly, CudnnCompatibleRNN-trained checkpoints can be loaded by CudnnRNN classes seamlessly. Below is a typical workflow(using LSTM as an example): for detailed performance guide.
Use Cudnn-trained checkpoints with CudnnCompatibleRNNs
with tf.Graph().as_default():
lstm = CudnnLSTM(num_layers, num_units, direction, ...)
outputs, output_states = lstm(inputs, initial_states, training=True)
# If user plans to delay calling the cell with inputs, one can do
# lstm.build(input_shape)
saver = Saver()
# training subgraph
...
# Once in a while save the model.
saver.save(save_path)
# Inference subgraph for unidirectional RNN on, e.g., CPU or mobile.
with tf.Graph().as_default():
single_cell = lambda: tf.contrib.cudnn_rnn.CudnnCompatibleLSTM(num_units)
# NOTE: Even if there's only one layer, the cell needs to be wrapped in
# MultiRNNCell.
cell = tf.nn.rnn_cell.MultiRNNCell(
  [single_cell() for _ in range(num_layers)])
# Leave the scope arg unset.
outputs, final_state = tf.nn.dynamic_rnn(cell, inputs, initial_state, ...)
saver = Saver()
# Create session
sess = ...
# Restores
saver.restore(sess, save_path)
# Inference subgraph for bidirectional RNN
with tf.Graph().as_default():
single_cell = lambda: tf.contrib.cudnn_rnn.CudnnCompatibleLSTM(num_units)
cells_fw = [single_cell() for _ in range(num_layers)]
cells_bw = [single_cell() for _ in range(num_layers)]
# Leave the scope arg unset.
(outputs, output_state_fw,
 output_state_bw) = tf.contrib.rnn.stack_bidirectional_dynamic_rnn(
     cells_fw, cells_bw, inputs, ...)
saver = Saver()
# Create session
sess = ...
# Restores
saver.restore(sess, save_path)
CudnnCompatibleGRUCell is used in ds2_encoder.py when "use_cudnn_rnn": False. Based on the link above it seems possible to transform weights from the checkpoint trained with "use_cudnn_rnn": True somehow. But I am not sure that all the components used are compatible.

Do you confirm retraining as the only viable solution?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/OpenSeq2Seq/issues/335#issuecomment-453611182, or mute the thread https://github.com/notifications/unsubscribe-auth/AHMWqRoCHNY1TAx5qigbI8vgI8WZT9JMks5vCNcZgaJpZM4Z70V6 .

Understood! Thanks for looking into it. Kudos for the library!

As a follow up I could load CudnnGRU trained weights on CPU by: 1) changing use_cudnn_rnn to False in the config; 2) replacing bidirectional_dynamic_rnn and MultiRNNCell with stack_bidirectional_dynamic_rnn inside ds2_encoder.py; 3) adding tf.variable_scope("cudnn_gru") before constructing RNN layers in ds2_encoder.py;

The steps above generate tensorflow graph that is compatible with CudnnGRU.

Relevant gist that was helpful: https://gist.github.com/melgor/41e7d9367410b71dfddc33db34cba85f

@asydorchuk : Thank you for sharing you workaround. Wondering if you can share a bit more information on your step #3?

adding tf.variable_scope("cudnn_gru") before constructing RNN layers in ds2_encoder.py

Where exactly in ds2_encoder.py do you add tf.variable_scope("cudnn_gru")? Would you mind sharing your working code?

To be more specific, I got the following error after following your 3 steps:

NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key ForwardPass/ds2_encoder/bidirectional_rnn/bw/multi_rnn_cell/cell_0/cudnn_compatible_gru_cell/candidate/hidden_projection/bias not found in checkpoint
     [[node save/RestoreV2 (defined at <ipython-input-5-4444675cc8c1>:15) ]]

NVIDIA / OpenSeq2Seq

Use CudnnGRU trained model on CPU #335

Use Cudnn-trained checkpoints with CudnnCompatibleRNNs