I was trying to train the librispeech with mirrorstrategy, but it waits forever after loading cuDNN 3 times, each time for one GPU card. Soon after that, we should expect the horizontal progress bar, however, it stops there forever. Perhaps this is related to the nccl library, but I did not see any error or warning messages related to this in the log. What would be a possible cause of this?
Below is my training log, train.py, and config, respectively.
2022-09-08 21:45:04.762548: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-09-08 21:45:06.786738: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-08 21:45:09.329075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22190 MB memory: -> device: 0, name: NVIDIA A30, pci bus id: 0000:17:00.0, compute capability: 8.0
2022-09-08 21:45:09.330450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 22190 MB memory: -> device: 1, name: NVIDIA A30, pci bus id: 0000:65:00.0, compute capability: 8.0
2022-09-08 21:45:09.331723: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 22190 MB memory: -> device: 2, name: NVIDIA A30, pci bus id: 0000:ca:00.0, compute capability: 8.0
INFO:tensorflow:Use RNNT loss in TensorFlow
WARNING:tensorflow:Please provide a TPU Name to connect to.
INFO:tensorflow:Run on 3 Physical GPUs
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2')
INFO:tensorflow:Loading subwords ...
INFO:tensorflow:Reading /home/liuyi/TensorFlowASR/dataset/LibriSpeech/train-clean-100/transcripts.tsv ...
2022-09-08 21:45:15.598126: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 AVX512F FMA
WARNING:tensorflow:Using a while_loop for converting IO>AudioResample
INFO:tensorflow:Reading /home/liuyi/TensorFlowASR/dataset/LibriSpeech/dev-clean/transcripts.tsv ...
WARNING:tensorflow:Using a while_loop for converting IO>AudioResample
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Model: "conformer_encoder"
____________________________________________________________________________________________________
Layer (type) Output Shape Param #
====================================================================================================
conformer_encoder_subsampling (Conv2dSubsam multiple 188208
pling)
conformer_encoder_pe (PositionalEncoding) multiple 0
conformer_encoder_linear (Dense) multiple 414864
conformer_encoder_dropout (Dropout) multiple 0
conformer_encoder_block_0 (ConformerBlock) multiple 506736
conformer_encoder_block_1 (ConformerBlock) multiple 506736
conformer_encoder_block_2 (ConformerBlock) multiple 506736
conformer_encoder_block_3 (ConformerBlock) multiple 506736
conformer_encoder_block_4 (ConformerBlock) multiple 506736
conformer_encoder_block_5 (ConformerBlock) multiple 506736
conformer_encoder_block_6 (ConformerBlock) multiple 506736
conformer_encoder_block_7 (ConformerBlock) multiple 506736
conformer_encoder_block_8 (ConformerBlock) multiple 506736
conformer_encoder_block_9 (ConformerBlock) multiple 506736
conformer_encoder_block_10 (ConformerBlock) multiple 506736
conformer_encoder_block_11 (ConformerBlock) multiple 506736
conformer_encoder_block_12 (ConformerBlock) multiple 506736
conformer_encoder_block_13 (ConformerBlock) multiple 506736
conformer_encoder_block_14 (ConformerBlock) multiple 506736
conformer_encoder_block_15 (ConformerBlock) multiple 506736
====================================================================================================
Total params: 8,710,848
Trainable params: 8,706,240
Non-trainable params: 4,608
____________________________________________________________________________________________________
Model: "conformer_prediction"
____________________________________________________________________________________________________
Layer (type) Output Shape Param #
====================================================================================================
conformer_prediction_embedding (Embedding) multiple 329600
conformer_prediction_dropout (Dropout) multiple 0
conformer_prediction_ln_0 (LayerNormalizati multiple 640
on)
conformer_prediction_lstm_0 (LSTM) multiple 820480
====================================================================================================
Total params: 1,150,720
Trainable params: 1,150,720
Non-trainable params: 0
____________________________________________________________________________________________________
Model: "conformer_joint"
____________________________________________________________________________________________________
Layer (type) Output Shape Param #
====================================================================================================
conformer_joint_tanh (Activation) multiple 0
conformer_joint_enc (Dense) multiple 46400
conformer_joint_pred (Dense) multiple 102400
conformer_joint_enc_reshape (TransducerJoin multiple 0
tReshape)
conformer_joint_pred_reshape (TransducerJoi multiple 0
ntReshape)
conformer_joint_add (Add) multiple 0
conformer_joint_vocab (Dense) multiple 330630
====================================================================================================
Total params: 479,430
Trainable params: 479,430
Non-trainable params: 0
____________________________________________________________________________________________________
Model: "conformer"
____________________________________________________________________________________________________
Layer (type) Output Shape Param #
====================================================================================================
conformer_encoder (ConformerEncoder) multiple 8710848
conformer_prediction (TransducerPrediction) multiple 1150720
conformer_joint (TransducerJoint) multiple 479430
====================================================================================================
Total params: 10,340,998
Trainable params: 10,336,390
Non-trainable params: 4,608
____________________________________________________________________________________________________
WARNING:tensorflow:The argument `steps_per_execution` is no longer experimental. Pass `steps_per_execution` instead of `experimental_steps_per_execution`.
WARNING:tensorflow:`tf.keras.callbacks.experimental.BackupAndRestore` endpoint is deprecated and will be removed in a future release. Please use `tf.keras.callbacks.BackupAndRestore`.
2022-09-08 21:45:23.949596: I tensorflow/core/profiler/lib/profiler_session.cc:99] Profiler session initializing.
2022-09-08 21:45:23.949617: I tensorflow/core/profiler/lib/profiler_session.cc:114] Profiler session started.
2022-09-08 21:45:23.949649: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1665] Profiler found 3 GPUs
2022-09-08 21:45:24.331748: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session tear down.
2022-09-08 21:45:24.331912: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1799] CUPTI activity buffer flushed
2022-09-08 21:45:24.380570: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:776] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: "TensorSliceDataset/_1"
op: "TensorSliceDataset"
input: "Placeholder/_0"
attr {
key: "Toutput_types"
value {
list {
type: DT_STRING
}
}
}
attr {
key: "_cardinality"
value {
i: 28539
}
}
attr {
key: "is_files"
value {
b: false
}
}
attr {
key: "metadata"
value {
s: "\n\024TensorSliceDataset:0"
}
}
attr {
key: "output_shapes"
value {
list {
shape {
dim {
size: 3
}
}
}
}
}
experimental_type {
type_id: TFT_PRODUCT
args {
type_id: TFT_DATASET
args {
type_id: TFT_PRODUCT
args {
type_id: TFT_TENSOR
args {
type_id: TFT_STRING
}
}
}
}
}
Epoch 1/5
INFO:tensorflow:batch_all_reduce: 560 all-reduces with algorithm = nccl, num_packs = 1
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2').
INFO:tensorflow:batch_all_reduce: 560 all-reduces with algorithm = nccl, num_packs = 1
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2').
2022-09-08 21:47:40.630612: W tensorflow/core/common_runtime/forward_type_inference.cc:231] Type inference failed. This indicates an invalid graph that escaped type checking. Error message: INVALID_ARGUMENT: expected compatible input types, but input 1:
type_id: TFT_OPTIONAL
args {
type_id: TFT_PRODUCT
args {
type_id: TFT_TENSOR
args {
type_id: TFT_LEGACY_VARIANT
}
}
}
is neither a subtype nor a supertype of the combined inputs preceding it:
type_id: TFT_OPTIONAL
args {
type_id: TFT_PRODUCT
args {
type_id: TFT_TENSOR
args {
type_id: TFT_FLOAT
}
}
}
while inferring type of node 'cond_21/output/_19'
2022-09-08 21:47:41.384646: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8201
2022-09-08 21:47:43.155226: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8201
2022-09-08 21:47:44.345331: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8201
2022-09-08 21:47:47.313726: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
# Copyright 2020 Huy Le Nguyen (@usimarit)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import fire
import math
from tensorflow_asr.utils import env_util
logger = env_util.setup_environment()
import tensorflow as tf
from tensorflow_asr.configs.config import Config
from tensorflow_asr.helpers import featurizer_helpers, dataset_helpers
from tensorflow_asr.models.transducer.conformer import Conformer
from tensorflow_asr.optimizers.schedules import TransformerSchedule
DEFAULT_YAML = os.path.join(os.path.abspath(os.path.dirname(__file__)), "config.yml")
def main(
config: str = DEFAULT_YAML,
tfrecords: bool = False,
sentence_piece: bool = False,
subwords: bool = True,
bs: int = None,
spx: int = 1,
metadata: str = None,
static_length: bool = False,
devices: list = [0,1,2],
mxp: bool = False,
pretrained: str = None,
):
tf.keras.backend.clear_session()
tf.config.optimizer.set_experimental_options({"auto_mixed_precision": mxp})
strategy = env_util.setup_strategy(devices)
config = Config(config)
speech_featurizer, text_featurizer = featurizer_helpers.prepare_featurizers(
config=config,
subwords=subwords,
sentence_piece=sentence_piece,
)
train_dataset, eval_dataset = dataset_helpers.prepare_training_datasets(
config=config,
speech_featurizer=speech_featurizer,
text_featurizer=text_featurizer,
tfrecords=tfrecords,
metadata=metadata,
)
if not static_length:
speech_featurizer.reset_length()
text_featurizer.reset_length()
train_data_loader, eval_data_loader, global_batch_size = dataset_helpers.prepare_training_data_loaders(
config=config,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
strategy=strategy,
batch_size=bs,
)
with strategy.scope():
conformer = Conformer(**config.model_config, vocabulary_size=text_featurizer.num_classes)
conformer.make(speech_featurizer.shape, prediction_shape=text_featurizer.prepand_shape, batch_size=global_batch_size)
if pretrained:
conformer.load_weights(pretrained, by_name=True, skip_mismatch=True)
conformer.summary(line_length=100)
optimizer = tf.keras.optimizers.Adam(
TransformerSchedule(
d_model=conformer.dmodel,
warmup_steps=config.learning_config.optimizer_config.pop("warmup_steps", 10000),
max_lr=(0.05 / math.sqrt(conformer.dmodel)),
),
**config.learning_config.optimizer_config
)
conformer.compile(
optimizer=optimizer,
experimental_steps_per_execution=spx,
global_batch_size=global_batch_size,
blank=text_featurizer.blank,
)
callbacks = [
tf.keras.callbacks.ModelCheckpoint(**config.learning_config.running_config.checkpoint),
tf.keras.callbacks.experimental.BackupAndRestore(config.learning_config.running_config.states_dir),
tf.keras.callbacks.TensorBoard(**config.learning_config.running_config.tensorboard),
]
conformer.fit(
train_data_loader,
epochs=config.learning_config.running_config.num_epochs,
validation_data=eval_data_loader,
callbacks=callbacks,
steps_per_epoch=train_dataset.total_steps,
validation_steps=eval_dataset.total_steps if eval_data_loader else None,
)
if __name__ == "__main__":
main()
# fire.Fire(main)
# Copyright 2020 Huy Le Nguyen (@usimarit)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
speech_config:
sample_rate: 16000
frame_ms: 25
stride_ms: 10
num_feature_bins: 80
feature_type: log_mel_spectrogram
preemphasis: 0.97
normalize_signal: True
normalize_feature: True
normalize_per_frame: False
decoder_config:
vocabulary: /home/liuyi/TensorFlowASR/vocabularies/librispeech/librispeech_train_4_1030.subwords
target_vocab_size: 1000
max_subword_length: 10
blank_at_zero: True
beam_width: 0
norm_score: True
corpus_files:
- /home/liuyi/TensorFlowASR/dataset/LibriSpeech/train-clean-100/transcripts.tsv
model_config:
name: conformer
encoder_subsampling:
type: conv2d
filters: 144
kernel_size: 3
strides: 2
encoder_positional_encoding: sinusoid
encoder_dmodel: 144
encoder_num_blocks: 16
encoder_head_size: 36
encoder_num_heads: 4
encoder_mha_type: relmha
encoder_kernel_size: 32
encoder_fc_factor: 0.5
encoder_dropout: 0.1
prediction_embed_dim: 320
prediction_embed_dropout: 0
prediction_num_rnns: 1
prediction_rnn_units: 320
prediction_rnn_type: lstm
prediction_rnn_implementation: 2
prediction_layer_norm: True
prediction_projection_units: 0
joint_dim: 320
prejoint_linear: True
joint_activation: tanh
joint_mode: add
learning_config:
train_dataset_config:
use_tf: True
augmentation_config:
feature_augment:
time_masking:
num_masks: 10
mask_factor: 100
p_upperbound: 0.05
freq_masking:
num_masks: 1
mask_factor: 27
data_paths:
- /home/liuyi/TensorFlowASR/dataset/LibriSpeech/train-clean-100/transcripts.tsv
tfrecords_dir: null
shuffle: True
cache: True
buffer_size: 100
drop_remainder: True
stage: train
eval_dataset_config:
use_tf: True
data_paths:
- /home/liuyi/TensorFlowASR/dataset/LibriSpeech/dev-clean/transcripts.tsv
tfrecords_dir: null
shuffle: False
cache: True
buffer_size: 100
drop_remainder: True
stage: eval
test_dataset_config:
use_tf: True
data_paths:
- /home/liuyi/TensorFlowASR/dataset/LibriSpeech/test-clean/transcripts.tsv
tfrecords_dir: null
shuffle: False
cache: True
buffer_size: 100
drop_remainder: True
stage: test
optimizer_config:
warmup_steps: 40000
beta_1: 0.9
beta_2: 0.98
epsilon: 1e-9
running_config:
batch_size: 2
num_epochs: 5
checkpoint:
filepath: /home/liuyi/TensorFlowASR/Models/conformer/checkpoints/{epoch:02d}.h5
save_best_only: True
save_weights_only: False
save_freq: epoch
states_dir: /home/liuyi/TensorFlowASR/Models/conformer/states
tensorboard:
log_dir: /home/liuyi/TensorFlowASR/Models/conformer/tensorboard
histogram_freq: 1
write_graph: True
write_images: True
update_freq: epoch
profile_batch: 2
TF-GPU 2.9 Ubuntu 22.04 Nvidia A30 x 3 python 3.8
I was trying to train the librispeech with mirrorstrategy, but it waits forever after loading cuDNN 3 times, each time for one GPU card. Soon after that, we should expect the horizontal progress bar, however, it stops there forever. Perhaps this is related to the nccl library, but I did not see any error or warning messages related to this in the log. What would be a possible cause of this?
Below is my training log, train.py, and config, respectively.