Closed liuyibox closed 2 years ago
@liuyibox Can you share the config?
@liuyibox Can you share the config?
Below is my config. By the way, I remove the mirrorStrategy in the code because I cannot make it run with the mirrorStrategy, so probably this is the cause of the error since the original code compile the graph with strategy.scope():
and I remove this strategy parts in the train.py file
speech_config: sample_rate: 16000 frame_ms: 25 stride_ms: 10 num_feature_bins: 80 feature_type: log_mel_spectrogram preemphasis: 0.97 normalize_signal: True normalize_feature: True normalize_per_frame: False
decoder_config: vocabulary: /home/liuyi/TensorFlowASR/vocabularies/librispeech/librispeech_train_4_1030.subwords target_vocab_size: 1000 max_subword_length: 10 blank_at_zero: True beam_width: 0 norm_score: True corpus_files:
model_config: name: conformer encoder_subsampling: type: conv2d filters: 144 kernel_size: 3 strides: 2 encoder_positional_encoding: sinusoid encoder_dmodel: 144 encoder_num_blocks: 16 encoder_head_size: 36 encoder_num_heads: 4 encoder_mha_type: relmha encoder_kernel_size: 32 encoder_fc_factor: 0.5 encoder_dropout: 0.1 prediction_embed_dim: 320 prediction_embed_dropout: 0 prediction_num_rnns: 1 prediction_rnn_units: 320 prediction_rnn_type: lstm prediction_rnn_implementation: 2 prediction_layer_norm: True prediction_projection_units: 0 joint_dim: 320 prejoint_linear: True joint_activation: tanh joint_mode: add
learning_config: train_dataset_config: use_tf: True augmentation_config: feature_augment: time_masking: num_masks: 10 mask_factor: 100 p_upperbound: 0.05 freq_masking: num_masks: 1 mask_factor: 27 data_paths:
/home/liuyi/TensorFlowASR/dataset/LibriSpeech/train-clean-100/transcripts.tsv tfrecords_dir: null shuffle: True cache: True buffer_size: 100 drop_remainder: True stage: train
eval_dataset_config: use_tf: True data_paths:
/home/liuyi/TensorFlowASR/dataset/LibriSpeech/dev-clean/transcripts.tsv tfrecords_dir: null shuffle: False cache: True buffer_size: 100 drop_remainder: True stage: eval
test_dataset_config: use_tf: True data_paths:
/home/liuyi/TensorFlowASR/dataset/LibriSpeech/test-clean/transcripts.tsv tfrecords_dir: null shuffle: False cache: True buffer_size: 100 drop_remainder: True stage: test
optimizer_config: warmup_steps: 40000 beta_1: 0.9 beta_2: 0.98 epsilon: 1e-9
running_config: batch_size: 8 num_epochs: 1 checkpoint: filepath: /home/liuyi/TensorFlowASR/Models/conformer/checkpoints/{epoch:02d} save_best_only: False save_weights_only: False save_freq: epoch verbose: 1 states_dir: /home/liuyi/TensorFlowASR/Models/conformer/states tensorboard: log_dir: /home/liuyi/TensorFlowASR/Models/conformer/tensorboard histogram_freq: 1 write_graph: True write_images: True update_freq: epoch profile_batch: 2
@liuyibox There're still some issues when using "save_weights_only: False" (I'm working on this). So you should use "save_weights_only: True" to only store the weights in the checkpoints.
The mirrorStrategy can work for 1GPU, if you have multiple gpus you can pass --devices=[0,1] to use only on gpu 0 and 1, or --devices=[0] or --devices=[1] to use single corresponding gpu.
@liuyibox There're still some issues when using "save_weights_only: False" (I'm working on this). So you should use "save_weights_only: True" to only store the weights in the checkpoints.
The mirrorStrategy can work for 1GPU, if you have multiple gpus you can pass --devices=[0,1] to use only on gpu 0 and 1, or --devices=[0] or --devices=[1] to use single corresponding gpu.
Thanks @usimarit Here is my train.py. With the mirrorstrategy, the training process waits forever after loading the cudnn library 3 time for the 3 GPU cards and does not proceed to the horizontal progress bar. During the forever waiting, the GPU memory and utilization are fully saturated even if I install the nccl. I think they are busy with something else. So I have to remove the strategy, any hints on why the mirrorstrategy keeps waiting and does not proceed with training? The current train.py can run with only the first GPU card, i.e., the [0].
import os
import fire
import math
from tensorflow_asr.utils import env_util
logger = env_util.setup_environment()
import tensorflow as tf
from tensorflow_asr.configs.config import Config
from tensorflow_asr.helpers import featurizer_helpers, dataset_helpers
from tensorflow_asr.models.transducer.conformer import Conformer
from tensorflow_asr.optimizers.schedules import TransformerSchedule
DEFAULT_YAML = os.path.join(os.path.abspath(os.path.dirname(__file__)), "config.yml")
def main(
config: str = DEFAULT_YAML,
tfrecords: bool = False,
sentence_piece: bool = False,
subwords: bool = True,
bs: int = None,
spx: int = 1,
metadata: str = None,
static_length: bool = False,
devices: list = [0,1,2],
mxp: bool = True,
pretrained: str = None,
):
tf.keras.backend.clear_session()
# tf.config.optimizer.set_experimental_options({"auto_mixed_precision": mxp})
config = Config(config)
speech_featurizer, text_featurizer = featurizer_helpers.prepare_featurizers(
config=config,
subwords=subwords,
sentence_piece=sentence_piece,
)
train_dataset, eval_dataset = dataset_helpers.prepare_training_datasets(
config=config,
speech_featurizer=speech_featurizer,
text_featurizer=text_featurizer,
tfrecords=tfrecords,
metadata=metadata,
)
if not static_length:
speech_featurizer.reset_length()
text_featurizer.reset_length()
train_data_loader, eval_data_loader, global_batch_size = dataset_helpers.prepare_training_data_loaders(
config=config,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
batch_size=bs,
)
conformer = Conformer(**config.model_config, vocabulary_size=text_featurizer.num_classes)
conformer.make(speech_featurizer.shape, prediction_shape=text_featurizer.prepand_shape, batch_size=global_batch_size)
if pretrained:
conformer.load_weights(pretrained, by_name=True, skip_mismatch=True)
conformer.summary(line_length=100)
optimizer = tf.keras.optimizers.Adam(
TransformerSchedule(
d_model=conformer.dmodel,
warmup_steps=config.learning_config.optimizer_config.pop("warmup_steps", 10000),
max_lr=(0.05 / math.sqrt(conformer.dmodel)),
),
**config.learning_config.optimizer_config
)
conformer.compile(
optimizer=optimizer,
experimental_steps_per_execution=spx,
global_batch_size=global_batch_size,
blank=text_featurizer.blank,
run_eagerly=True,
)
callbacks = [
tf.keras.callbacks.ModelCheckpoint(**config.learning_config.running_config.checkpoint),
tf.keras.callbacks.experimental.BackupAndRestore(config.learning_config.running_config.states_dir),
tf.keras.callbacks.TensorBoard(**config.learning_config.running_config.tensorboard),
]
conformer.fit(
train_data_loader,
epochs=config.learning_config.running_config.num_epochs,
validation_data=eval_data_loader,
callbacks=callbacks,
steps_per_epoch=train_dataset.total_steps,
validation_steps=eval_dataset.total_steps if eval_data_loader else None,
)
if __name__ == "__main__":
os.environ["CUDA_VISIBLE_DEVICES"]="1"
main()
@liuyibox Did the training pass to the stage where the model's summary is printed (when mirror strategy is applied)?
The run_eagerly=True
make the model training is not wrapped in tf.function
, therefore slow down the training, eagerly should be used for debugging only.
This issue is solved when I use "save_weights_only: True". I will open another issue for the mirrorstrategy training issue. Thank you.
TF-GPU 2.9 Ubuntu 22.04 Nvidia A30 x 3 python 3.8
When I was trying to run through the pipeline of training the conformed, I cannot save it at the end of first epoch. It reports the error
TypeError: Unable to serialize 144.0 to JSON. Unrecognized type <class 'tensorflow.python.framework.ops.EagerTensor'>.
Does this mean I want to compile the tensorflow model with the run_eagerly set to be True? How to save the model with the tensorflow eager tensor variable? The training log is below, Thank you.There is also another nimor error saying
Type inference failed. This indicates an invalid graph that escaped type checking
. Is this minor error related to the saving error?