coqui-ai / STT

🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.
https://coqui.ai
Mozilla Public License 2.0
2.3k stars 278 forks source link

Bug: Batch size check breaks training with `--train_cudnn true` due to leftover Tensors in the check graph #2110

Closed reuben closed 2 years ago

reuben commented 2 years ago
D Session closed.
I Dummy run finished without problems, now starting real training process.
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/code/training/coqui_stt_training/train.py", line 723, in <module>
    main()
  File "/code/training/coqui_stt_training/train.py", line 693, in main
    train()
  File "/code/training/coqui_stt_training/train.py", line 332, in train
    train_impl(epochs=Config.epochs, silent_load=True)
  File "/code/training/coqui_stt_training/train.py", line 390, in train_impl
    iterator, optimizer, dropout_rates
  File "/code/training/coqui_stt_training/train.py", line 172, in get_tower_results
    iterator, dropout_rates, reuse=i > 0
  File "/code/training/coqui_stt_training/train.py", line 90, in calculate_mean_edit_distance_and_loss
    batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl
  File "/code/training/coqui_stt_training/deepspeech_model.py", line 232, in create_model
    output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse)
  File "/code/training/coqui_stt_training/deepspeech_model.py", line 135, in rnn_impl_cudnn_rnn
    inputs=x, sequence_lengths=seq_length
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/layers/base.py", line 548, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 237, in wrapper
    raise e.ag_error_metadata.to_exception(e)
ValueError: in converted code:
    relative to /usr/local/lib/python3.6/dist-packages/tensorflow_core:

    contrib/cudnn_rnn/python/layers/cudnn_rnn.py:440 call
        training)
    contrib/cudnn_rnn/python/layers/cudnn_rnn.py:518 _forward
        seed=self._seed)
    contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py:1132 _cudnn_rnn
        outputs, output_h, output_c, _, _ = gen_cudnn_rnn_ops.cudnn_rnnv3(**args)
    python/ops/gen_cudnn_rnn_ops.py:2051 cudnn_rnnv3
        time_major=time_major, name=name)
    python/framework/op_def_library.py:367 _apply_op_helper
        g = ops._get_graph_from_inputs(_Flatten(keywords.values()))
    python/framework/ops.py:5979 _get_graph_from_inputs
        _assert_same_graph(original_graph_element, graph_element)
    python/framework/ops.py:5914 _assert_same_graph
        (item, original_item))

    ValueError: Tensor("cudnn_lstm/opaque_kernel:0", dtype=float32_ref, device=/device:GPU:0) must be from the same graph as Tensor("tower_0/Reshape_2:0", shape=(?, ?, 2048), dtype=float32, device=/device:GPU:0).
reuben commented 2 years ago

Workaround is to comment out the batch size check: https://github.com/coqui-ai/STT/blob/49beaf51ebade3dc2f9d0985ee0041d389bd2dfd/training/coqui_stt_training/train.py#L324-L331