Multi-GPU card training with MirrorStrategy wait forever after loading the cudnn

TF-GPU 2.9 Ubuntu 22.04 Nvidia A30 x 3 python 3.8

I was trying to train the librispeech with mirrorstrategy, but it waits forever after loading cuDNN 3 times, each time for one GPU card. Soon after that, we should expect the horizontal progress bar, however, it stops there forever. Perhaps this is related to the nccl library, but I did not see any error or warning messages related to this in the log. What would be a possible cause of this?

Below is my training log, train.py, and config, respectively.

2022-09-08 21:45:04.762548: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-09-08 21:45:06.786738: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-08 21:45:09.329075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22190 MB memory:  -> device: 0, name: NVIDIA A30, pci bus id: 0000:17:00.0, compute capability: 8.0
2022-09-08 21:45:09.330450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 22190 MB memory:  -> device: 1, name: NVIDIA A30, pci bus id: 0000:65:00.0, compute capability: 8.0
2022-09-08 21:45:09.331723: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 22190 MB memory:  -> device: 2, name: NVIDIA A30, pci bus id: 0000:ca:00.0, compute capability: 8.0
INFO:tensorflow:Use RNNT loss in TensorFlow
WARNING:tensorflow:Please provide a TPU Name to connect to.
INFO:tensorflow:Run on 3 Physical GPUs
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2')
INFO:tensorflow:Loading subwords ...
INFO:tensorflow:Reading /home/liuyi/TensorFlowASR/dataset/LibriSpeech/train-clean-100/transcripts.tsv ...
2022-09-08 21:45:15.598126: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 AVX512F FMA
WARNING:tensorflow:Using a while_loop for converting IO>AudioResample
INFO:tensorflow:Reading /home/liuyi/TensorFlowASR/dataset/LibriSpeech/dev-clean/transcripts.tsv ...
WARNING:tensorflow:Using a while_loop for converting IO>AudioResample
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Model: "conformer_encoder"
____________________________________________________________________________________________________
 Layer (type)                                Output Shape                            Param #        
====================================================================================================
 conformer_encoder_subsampling (Conv2dSubsam  multiple                               188208         
 pling)                                                                                             

 conformer_encoder_pe (PositionalEncoding)   multiple                                0              

 conformer_encoder_linear (Dense)            multiple                                414864         

 conformer_encoder_dropout (Dropout)         multiple                                0              

 conformer_encoder_block_0 (ConformerBlock)  multiple                                506736         

 conformer_encoder_block_1 (ConformerBlock)  multiple                                506736         

 conformer_encoder_block_2 (ConformerBlock)  multiple                                506736         

 conformer_encoder_block_3 (ConformerBlock)  multiple                                506736         

 conformer_encoder_block_4 (ConformerBlock)  multiple                                506736         

 conformer_encoder_block_5 (ConformerBlock)  multiple                                506736         

 conformer_encoder_block_6 (ConformerBlock)  multiple                                506736         

 conformer_encoder_block_7 (ConformerBlock)  multiple                                506736         

 conformer_encoder_block_8 (ConformerBlock)  multiple                                506736         

 conformer_encoder_block_9 (ConformerBlock)  multiple                                506736         

 conformer_encoder_block_10 (ConformerBlock)  multiple                               506736         

 conformer_encoder_block_11 (ConformerBlock)  multiple                               506736         

 conformer_encoder_block_12 (ConformerBlock)  multiple                               506736         

 conformer_encoder_block_13 (ConformerBlock)  multiple                               506736         

 conformer_encoder_block_14 (ConformerBlock)  multiple                               506736         

 conformer_encoder_block_15 (ConformerBlock)  multiple                               506736         

====================================================================================================
Total params: 8,710,848
Trainable params: 8,706,240
Non-trainable params: 4,608
____________________________________________________________________________________________________
Model: "conformer_prediction"
____________________________________________________________________________________________________
 Layer (type)                                Output Shape                            Param #        
====================================================================================================
 conformer_prediction_embedding (Embedding)  multiple                                329600         

 conformer_prediction_dropout (Dropout)      multiple                                0              

 conformer_prediction_ln_0 (LayerNormalizati  multiple                               640            
 on)                                                                                                

 conformer_prediction_lstm_0 (LSTM)          multiple                                820480         

====================================================================================================
Total params: 1,150,720
Trainable params: 1,150,720
Non-trainable params: 0
____________________________________________________________________________________________________
Model: "conformer_joint"
____________________________________________________________________________________________________
 Layer (type)                                Output Shape                            Param #        
====================================================================================================
 conformer_joint_tanh (Activation)           multiple                                0              

 conformer_joint_enc (Dense)                 multiple                                46400          

 conformer_joint_pred (Dense)                multiple                                102400         

 conformer_joint_enc_reshape (TransducerJoin  multiple                               0              
 tReshape)                                                                                          

 conformer_joint_pred_reshape (TransducerJoi  multiple                               0              
 ntReshape)                                                                                         

 conformer_joint_add (Add)                   multiple                                0              

 conformer_joint_vocab (Dense)               multiple                                330630         

====================================================================================================
Total params: 479,430
Trainable params: 479,430
Non-trainable params: 0
____________________________________________________________________________________________________
Model: "conformer"
____________________________________________________________________________________________________
 Layer (type)                                Output Shape                            Param #        
====================================================================================================
 conformer_encoder (ConformerEncoder)        multiple                                8710848        

 conformer_prediction (TransducerPrediction)  multiple                               1150720        

 conformer_joint (TransducerJoint)           multiple                                479430         

====================================================================================================
Total params: 10,340,998
Trainable params: 10,336,390
Non-trainable params: 4,608
____________________________________________________________________________________________________
WARNING:tensorflow:The argument `steps_per_execution` is no longer experimental. Pass `steps_per_execution` instead of `experimental_steps_per_execution`.
WARNING:tensorflow:`tf.keras.callbacks.experimental.BackupAndRestore` endpoint is deprecated and will be removed in a future release. Please use `tf.keras.callbacks.BackupAndRestore`.
2022-09-08 21:45:23.949596: I tensorflow/core/profiler/lib/profiler_session.cc:99] Profiler session initializing.
2022-09-08 21:45:23.949617: I tensorflow/core/profiler/lib/profiler_session.cc:114] Profiler session started.
2022-09-08 21:45:23.949649: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1665] Profiler found 3 GPUs
2022-09-08 21:45:24.331748: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session tear down.
2022-09-08 21:45:24.331912: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1799] CUPTI activity buffer flushed
2022-09-08 21:45:24.380570: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:776] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: "TensorSliceDataset/_1"
op: "TensorSliceDataset"
input: "Placeholder/_0"
attr {
  key: "Toutput_types"
  value {
    list {
      type: DT_STRING
    }
  }
}
attr {
  key: "_cardinality"
  value {
    i: 28539
  }
}
attr {
  key: "is_files"
  value {
    b: false
  }
}
attr {
  key: "metadata"
  value {
    s: "\n\024TensorSliceDataset:0"
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: 3
        }
      }
    }
  }
}
experimental_type {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_DATASET
    args {
      type_id: TFT_PRODUCT
      args {
        type_id: TFT_TENSOR
        args {
          type_id: TFT_STRING
        }
      }
    }
  }
}

Epoch 1/5
INFO:tensorflow:batch_all_reduce: 560 all-reduces with algorithm = nccl, num_packs = 1
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2').
INFO:tensorflow:batch_all_reduce: 560 all-reduces with algorithm = nccl, num_packs = 1
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2').
2022-09-08 21:47:40.630612: W tensorflow/core/common_runtime/forward_type_inference.cc:231] Type inference failed. This indicates an invalid graph that escaped type checking. Error message: INVALID_ARGUMENT: expected compatible input types, but input 1:
type_id: TFT_OPTIONAL
args {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_TENSOR
    args {
      type_id: TFT_LEGACY_VARIANT
    }
  }
}
 is neither a subtype nor a supertype of the combined inputs preceding it:
type_id: TFT_OPTIONAL
args {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_TENSOR
    args {
      type_id: TFT_FLOAT
    }
  }
}

    while inferring type of node 'cond_21/output/_19'
2022-09-08 21:47:41.384646: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8201
2022-09-08 21:47:43.155226: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8201
2022-09-08 21:47:44.345331: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8201
2022-09-08 21:47:47.313726: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.


# Copyright 2020 Huy Le Nguyen (@usimarit)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os
import fire
import math
from tensorflow_asr.utils import env_util

logger = env_util.setup_environment()
import tensorflow as tf

from tensorflow_asr.configs.config import Config
from tensorflow_asr.helpers import featurizer_helpers, dataset_helpers
from tensorflow_asr.models.transducer.conformer import Conformer
from tensorflow_asr.optimizers.schedules import TransformerSchedule

DEFAULT_YAML = os.path.join(os.path.abspath(os.path.dirname(__file__)), "config.yml")

def main(
    config: str = DEFAULT_YAML,
    tfrecords: bool = False,
    sentence_piece: bool = False,
    subwords: bool = True,
    bs: int = None,
    spx: int = 1,
    metadata: str = None,
    static_length: bool = False,
    devices: list = [0,1,2],
    mxp: bool = False,
    pretrained: str = None,
):
    tf.keras.backend.clear_session()
    tf.config.optimizer.set_experimental_options({"auto_mixed_precision": mxp})
    strategy = env_util.setup_strategy(devices)

    config = Config(config)

    speech_featurizer, text_featurizer = featurizer_helpers.prepare_featurizers(
        config=config,
        subwords=subwords,
        sentence_piece=sentence_piece,
    )

    train_dataset, eval_dataset = dataset_helpers.prepare_training_datasets(
        config=config,
        speech_featurizer=speech_featurizer,
        text_featurizer=text_featurizer,
        tfrecords=tfrecords,
        metadata=metadata,
    )

    if not static_length:
        speech_featurizer.reset_length()
        text_featurizer.reset_length()

    train_data_loader, eval_data_loader, global_batch_size = dataset_helpers.prepare_training_data_loaders(
        config=config,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        strategy=strategy,
        batch_size=bs,
    )

    with strategy.scope():
        conformer = Conformer(**config.model_config, vocabulary_size=text_featurizer.num_classes)
        conformer.make(speech_featurizer.shape, prediction_shape=text_featurizer.prepand_shape, batch_size=global_batch_size)
        if pretrained:
            conformer.load_weights(pretrained, by_name=True, skip_mismatch=True)
        conformer.summary(line_length=100)
        optimizer = tf.keras.optimizers.Adam(
            TransformerSchedule(
                d_model=conformer.dmodel,
                warmup_steps=config.learning_config.optimizer_config.pop("warmup_steps", 10000),
                max_lr=(0.05 / math.sqrt(conformer.dmodel)),
            ),
            **config.learning_config.optimizer_config
        )
        conformer.compile(
            optimizer=optimizer,
            experimental_steps_per_execution=spx,
            global_batch_size=global_batch_size,
            blank=text_featurizer.blank,
        )

    callbacks = [
        tf.keras.callbacks.ModelCheckpoint(**config.learning_config.running_config.checkpoint),
        tf.keras.callbacks.experimental.BackupAndRestore(config.learning_config.running_config.states_dir),
        tf.keras.callbacks.TensorBoard(**config.learning_config.running_config.tensorboard),
    ]

    conformer.fit(
        train_data_loader,
        epochs=config.learning_config.running_config.num_epochs,
        validation_data=eval_data_loader,
        callbacks=callbacks,
        steps_per_epoch=train_dataset.total_steps,
        validation_steps=eval_dataset.total_steps if eval_data_loader else None,
    )

if __name__ == "__main__":
    main()
#    fire.Fire(main)


# Copyright 2020 Huy Le Nguyen (@usimarit)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

speech_config:
  sample_rate: 16000
  frame_ms: 25
  stride_ms: 10
  num_feature_bins: 80
  feature_type: log_mel_spectrogram
  preemphasis: 0.97
  normalize_signal: True
  normalize_feature: True
  normalize_per_frame: False

decoder_config:
  vocabulary: /home/liuyi/TensorFlowASR/vocabularies/librispeech/librispeech_train_4_1030.subwords
  target_vocab_size: 1000
  max_subword_length: 10
  blank_at_zero: True
  beam_width: 0
  norm_score: True
  corpus_files:
    - /home/liuyi/TensorFlowASR/dataset/LibriSpeech/train-clean-100/transcripts.tsv

model_config:
  name: conformer
  encoder_subsampling:
    type: conv2d
    filters: 144
    kernel_size: 3
    strides: 2
  encoder_positional_encoding: sinusoid
  encoder_dmodel: 144
  encoder_num_blocks: 16
  encoder_head_size: 36
  encoder_num_heads: 4
  encoder_mha_type: relmha
  encoder_kernel_size: 32
  encoder_fc_factor: 0.5
  encoder_dropout: 0.1
  prediction_embed_dim: 320
  prediction_embed_dropout: 0
  prediction_num_rnns: 1
  prediction_rnn_units: 320
  prediction_rnn_type: lstm
  prediction_rnn_implementation: 2
  prediction_layer_norm: True
  prediction_projection_units: 0
  joint_dim: 320
  prejoint_linear: True
  joint_activation: tanh
  joint_mode: add

learning_config:
  train_dataset_config:
    use_tf: True
    augmentation_config:
      feature_augment:
        time_masking:
          num_masks: 10
          mask_factor: 100
          p_upperbound: 0.05
        freq_masking:
          num_masks: 1
          mask_factor: 27
    data_paths:
      - /home/liuyi/TensorFlowASR/dataset/LibriSpeech/train-clean-100/transcripts.tsv
    tfrecords_dir: null
    shuffle: True
    cache: True
    buffer_size: 100
    drop_remainder: True
    stage: train

  eval_dataset_config:
    use_tf: True
    data_paths:
      - /home/liuyi/TensorFlowASR/dataset/LibriSpeech/dev-clean/transcripts.tsv
    tfrecords_dir: null
    shuffle: False
    cache: True
    buffer_size: 100
    drop_remainder: True
    stage: eval

  test_dataset_config:
    use_tf: True
    data_paths:
      - /home/liuyi/TensorFlowASR/dataset/LibriSpeech/test-clean/transcripts.tsv
    tfrecords_dir: null
    shuffle: False
    cache: True
    buffer_size: 100
    drop_remainder: True
    stage: test

  optimizer_config:
    warmup_steps: 40000
    beta_1: 0.9
    beta_2: 0.98
    epsilon: 1e-9

  running_config:
    batch_size: 2
    num_epochs: 5
    checkpoint:
      filepath: /home/liuyi/TensorFlowASR/Models/conformer/checkpoints/{epoch:02d}.h5
      save_best_only: True
      save_weights_only: False
      save_freq: epoch
    states_dir: /home/liuyi/TensorFlowASR/Models/conformer/states
    tensorboard:
      log_dir: /home/liuyi/TensorFlowASR/Models/conformer/tensorboard
      histogram_freq: 1
      write_graph: True
      write_images: True
      update_freq: epoch
      profile_batch: 2

TensorSpeech / TensorFlowASR

Multi-GPU card training with MirrorStrategy wait forever after loading the cudnn #267