TypeError: Unable to serialize 144.0 to JSON. Unrecognized type <class 'tensorflow.python.framework.ops.EagerTensor

liuyibox commented 2 years ago

TF-GPU 2.9 Ubuntu 22.04 Nvidia A30 x 3 python 3.8

When I was trying to run through the pipeline of training the conformed, I cannot save it at the end of first epoch. It reports the error TypeError: Unable to serialize 144.0 to JSON. Unrecognized type <class 'tensorflow.python.framework.ops.EagerTensor'>. Does this mean I want to compile the tensorflow model with the run_eagerly set to be True? How to save the model with the tensorflow eager tensor variable? The training log is below, Thank you.

There is also another nimor error saying Type inference failed. This indicates an invalid graph that escaped type checking. Is this minor error related to the saving error?

2022-09-08 03:05:40.424175: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2022-09-08 03:05:42.409681: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-09-08 03:05:44.329983: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22190 MB memory: -> device: 0, name: NVIDIA A30, pci bus id: 0000:17:00.0, compute capability: 8.0 2022-09-08 03:05:44.331133: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 22190 MB memory: -> device: 1, name: NVIDIA A30, pci bus id: 0000:65:00.0, compute capability: 8.0 2022-09-08 03:05:44.332501: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 22190 MB memory: -> device: 2, name: NVIDIA A30, pci bus id: 0000:ca:00.0, compute capability: 8.0 INFO:tensorflow:Use RNNT loss in TensorFlow INFO:tensorflow:Loading subwords ... INFO:tensorflow:Reading /home/liuyi/TensorFlowASR/dataset/LibriSpeech/train-clean-100/transcripts.tsv ... 2022-09-08 03:05:50.554109: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 AVX512F FMA WARNING:tensorflow:Using a while_loop for converting IO>AudioResample INFO:tensorflow:Reading /home/liuyi/TensorFlowASR/dataset/LibriSpeech/dev-clean/transcripts.tsv ... WARNING:tensorflow:Using a while_loop for converting IO>AudioResample Model: "conformer_encoder"

Layer (type) Output Shape Param #

conformer_encoder_subsampling (Conv2dSubsam multiple 188208
pling)

conformer_encoder_pe (PositionalEncoding) multiple 0

conformer_encoder_linear (Dense) multiple 414864

conformer_encoder_dropout (Dropout) multiple 0

conformer_encoder_block_0 (ConformerBlock) multiple 506736

conformer_encoder_block_1 (ConformerBlock) multiple 506736

conformer_encoder_block_2 (ConformerBlock) multiple 506736

conformer_encoder_block_3 (ConformerBlock) multiple 506736

conformer_encoder_block_4 (ConformerBlock) multiple 506736

conformer_encoder_block_5 (ConformerBlock) multiple 506736

conformer_encoder_block_6 (ConformerBlock) multiple 506736

conformer_encoder_block_7 (ConformerBlock) multiple 506736

conformer_encoder_block_8 (ConformerBlock) multiple 506736

conformer_encoder_block_9 (ConformerBlock) multiple 506736

conformer_encoder_block_10 (ConformerBlock) multiple 506736

conformer_encoder_block_11 (ConformerBlock) multiple 506736

conformer_encoder_block_12 (ConformerBlock) multiple 506736

conformer_encoder_block_13 (ConformerBlock) multiple 506736

conformer_encoder_block_14 (ConformerBlock) multiple 506736

conformer_encoder_block_15 (ConformerBlock) multiple 506736

==================================================================================================== Total params: 8,710,848 Trainable params: 8,706,240 Non-trainable params: 4,608

Model: "conformer_prediction"

Layer (type) Output Shape Param #

conformer_prediction_embedding (Embedding) multiple 329600

conformer_prediction_dropout (Dropout) multiple 0

conformer_prediction_ln_0 (LayerNormalizati multiple 640
on)

conformer_prediction_lstm_0 (LSTM) multiple 820480

==================================================================================================== Total params: 1,150,720 Trainable params: 1,150,720 Non-trainable params: 0

Model: "conformer_joint"

Layer (type) Output Shape Param #

conformer_joint_tanh (Activation) multiple 0

conformer_joint_enc (Dense) multiple 46400

conformer_joint_pred (Dense) multiple 102400

conformer_joint_enc_reshape (TransducerJoin multiple 0
tReshape)

conformer_joint_pred_reshape (TransducerJoi multiple 0
ntReshape)

conformer_joint_add (Add) multiple 0

conformer_joint_vocab (Dense) multiple 330630

==================================================================================================== Total params: 479,430 Trainable params: 479,430 Non-trainable params: 0

Model: "conformer"

Layer (type) Output Shape Param #

conformer_encoder (ConformerEncoder) multiple 8710848

conformer_prediction (TransducerPrediction) multiple 1150720

conformer_joint (TransducerJoint) multiple 479430

==================================================================================================== Total params: 10,340,998 Trainable params: 10,336,390 Non-trainable params: 4,608

WARNING:tensorflow:The argument steps_per_execution is no longer experimental. Pass steps_per_execution instead of experimental_steps_per_execution. WARNING:tensorflow:tf.keras.callbacks.experimental.BackupAndRestore endpoint is deprecated and will be removed in a future release. Please use tf.keras.callbacks.BackupAndRestore. 2022-09-08 03:05:57.626959: I tensorflow/core/profiler/lib/profiler_session.cc:99] Profiler session initializing. 2022-09-08 03:05:57.626980: I tensorflow/core/profiler/lib/profiler_session.cc:114] Profiler session started. 2022-09-08 03:05:57.627024: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1665] Profiler found 3 GPUs 2022-09-08 03:05:57.967897: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session tear down. 2022-09-08 03:05:57.968079: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1799] CUPTI activity buffer flushed 2022-09-08 03:06:35.429164: W tensorflow/core/common_runtime/forward_type_inference.cc:231] Type inference failed. This indicates an invalid graph that escaped type checking. Error message: INVALID_ARGUMENT: expected compatible input types, but input 1: type_id: TFT_OPTIONAL args { type_id: TFT_PRODUCT args { type_id: TFT_TENSOR args { type_id: TFT_FLOAT } } } is neither a subtype nor a supertype of the combined inputs preceding it: type_id: TFT_OPTIONAL args { type_id: TFT_PRODUCT args { type_id: TFT_TENSOR args { type_id: TFT_INT32 } } }

while inferring type of node 'cond_21/output/_21' 2022-09-08 03:06:36.158643: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8201 2022-09-08 03:06:40.801756: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once. 1/3567 [..............................] - ETA: 45:30:20 - loss: 2296.20122022-09-08 03:06:44.003828: I tensorflow/core/profiler/lib/profiler_session.cc:99] Profiler session initializing. 2022-09-08 03:06:44.003857: I tensorflow/core/profiler/lib/profiler_session.cc:114] Profiler session started. 2022-09-08 03:06:46.776893: I tensorflow/core/profiler/lib/profiler_session.cc:66] Profiler session collecting data. 2022-09-08 03:06:46.781522: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1799] CUPTI activity buffer flushed 2022-09-08 03:06:47.013562: I tensorflow/core/profiler/internal/gpu/cupti_collector.cc:521] GpuTracer has collected 40018 callback api events and 39756 activity events. 2022-09-08 03:06:47.709260: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session tear down. 2022-09-08 03:06:48.630389: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: /home/liuyi/TensorFlowASR/Models/conformer/tensorboard/plugins/profile/2022_09_08_03_06_47

2022-09-08 03:06:49.337423: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for trace.json.gz to /home/liuyi/TensorFlowASR/Models/conformer/tensorboard/plugins/profile/2022_09_08_03_06_47/lenss.trace.json.gz 2022-09-08 03:06:50.150881: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: /home/liuyi/TensorFlowASR/Models/conformer/tensorboard/plugins/profile/2022_09_08_03_06_47

2022-09-08 03:06:50.181255: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for memory_profile.json.gz to /home/liuyi/TensorFlowASR/Models/conformer/tensorboard/plugins/profile/2022_09_08_03_06_47/lenss.memory_profile.json.gz 2022-09-08 03:06:50.202414: I tensorflow/core/profiler/rpc/client/capture_profile.cc:251] Creating directory: /home/liuyi/TensorFlowASR/Models/conformer/tensorboard/plugins/profile/2022_09_08_03_06_47 Dumped tool data for xplane.pb to /home/liuyi/TensorFlowASR/Models/conformer/tensorboard/plugins/profile/2022_09_08_03_06_47/lenss.xplane.pb Dumped tool data for overview_page.pb to /home/liuyi/TensorFlowASR/Models/conformer/tensorboard/plugins/profile/2022_09_08_03_06_47/lenss.overview_page.pb Dumped tool data for input_pipeline.pb to /home/liuyi/TensorFlowASR/Models/conformer/tensorboard/plugins/profile/2022_09_08_03_06_47/lenss.input_pipeline.pb Dumped tool data for tensorflow_stats.pb to /home/liuyi/TensorFlowASR/Models/conformer/tensorboard/plugins/profile/2022_09_08_03_06_47/lenss.tensorflow_stats.pb Dumped tool data for kernel_stats.pb to /home/liuyi/TensorFlowASR/Models/conformer/tensorboard/plugins/profile/2022_09_08_03_06_47/lenss.kernel_stats.pb

3567/3567 [==============================] - ETA: 0s - loss: 564.0592
Epoch 1: saving model to /home/liuyi/TensorFlowASR/Models/conformer/checkpoints/01 WARNING:absl:Found untraced functions such as recognize, recognize_beam, _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op while saving (showing 5 of 52). These functions will not be directly callable after loading. INFO:tensorflow:Assets written to: /home/liuyi/TensorFlowASR/Models/conformer/checkpoints/01/assets INFO:tensorflow:Assets written to: /home/liuyi/TensorFlowASR/Models/conformer/checkpoints/01/assets Traceback (most recent call last): File "train.py", line 113, in main() File "train.py", line 101, in main conformer.fit( File "/home/liuyi/anaconda3/envs/tfasr/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler raise e.with_traceback(filtered_tb) from None File "/home/liuyi/TensorFlowASR/tensorflow_asr/models/base_model.py", line 32, in save super().save( File "/home/liuyi/anaconda3/envs/tfasr/lib/python3.8/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/home/liuyi/anaconda3/envs/tfasr/lib/python3.8/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) TypeError: Unable to serialize 144.0 to JSON. Unrecognized type <class 'tensorflow.python.framework.ops.EagerTensor'>.

nglehuy commented 2 years ago

@liuyibox Can you share the config?

liuyibox commented 2 years ago

@liuyibox Can you share the config?

Below is my config. By the way, I remove the mirrorStrategy in the code because I cannot make it run with the mirrorStrategy, so probably this is the cause of the error since the original code compile the graph with strategy.scope(): and I remove this strategy parts in the train.py file

speech_config: sample_rate: 16000 frame_ms: 25 stride_ms: 10 num_feature_bins: 80 feature_type: log_mel_spectrogram preemphasis: 0.97 normalize_signal: True normalize_feature: True normalize_per_frame: False

decoder_config: vocabulary: /home/liuyi/TensorFlowASR/vocabularies/librispeech/librispeech_train_4_1030.subwords target_vocab_size: 1000 max_subword_length: 10 blank_at_zero: True beam_width: 0 norm_score: True corpus_files:

/home/liuyi/TensorFlowASR/dataset/LibriSpeech/train-clean-100/transcripts.tsv

model_config: name: conformer encoder_subsampling: type: conv2d filters: 144 kernel_size: 3 strides: 2 encoder_positional_encoding: sinusoid encoder_dmodel: 144 encoder_num_blocks: 16 encoder_head_size: 36 encoder_num_heads: 4 encoder_mha_type: relmha encoder_kernel_size: 32 encoder_fc_factor: 0.5 encoder_dropout: 0.1 prediction_embed_dim: 320 prediction_embed_dropout: 0 prediction_num_rnns: 1 prediction_rnn_units: 320 prediction_rnn_type: lstm prediction_rnn_implementation: 2 prediction_layer_norm: True prediction_projection_units: 0 joint_dim: 320 prejoint_linear: True joint_activation: tanh joint_mode: add

learning_config: train_dataset_config: use_tf: True augmentation_config: feature_augment: time_masking: num_masks: 10 mask_factor: 100 p_upperbound: 0.05 freq_masking: num_masks: 1 mask_factor: 27 data_paths:

/home/liuyi/TensorFlowASR/dataset/LibriSpeech/train-clean-100/transcripts.tsv tfrecords_dir: null shuffle: True cache: True buffer_size: 100 drop_remainder: True stage: train

eval_dataset_config: use_tf: True data_paths:
/home/liuyi/TensorFlowASR/dataset/LibriSpeech/dev-clean/transcripts.tsv tfrecords_dir: null shuffle: False cache: True buffer_size: 100 drop_remainder: True stage: eval

test_dataset_config: use_tf: True data_paths:
/home/liuyi/TensorFlowASR/dataset/LibriSpeech/test-clean/transcripts.tsv tfrecords_dir: null shuffle: False cache: True buffer_size: 100 drop_remainder: True stage: test

optimizer_config: warmup_steps: 40000 beta_1: 0.9 beta_2: 0.98 epsilon: 1e-9

running_config: batch_size: 8 num_epochs: 1 checkpoint: filepath: /home/liuyi/TensorFlowASR/Models/conformer/checkpoints/{epoch:02d} save_best_only: False save_weights_only: False save_freq: epoch verbose: 1 states_dir: /home/liuyi/TensorFlowASR/Models/conformer/states tensorboard: log_dir: /home/liuyi/TensorFlowASR/Models/conformer/tensorboard histogram_freq: 1 write_graph: True write_images: True update_freq: epoch profile_batch: 2

nglehuy commented 2 years ago

@liuyibox There're still some issues when using "save_weights_only: False" (I'm working on this). So you should use "save_weights_only: True" to only store the weights in the checkpoints.

The mirrorStrategy can work for 1GPU, if you have multiple gpus you can pass --devices=[0,1] to use only on gpu 0 and 1, or --devices=[0] or --devices=[1] to use single corresponding gpu.

liuyibox commented 2 years ago

@liuyibox There're still some issues when using "save_weights_only: False" (I'm working on this). So you should use "save_weights_only: True" to only store the weights in the checkpoints.

The mirrorStrategy can work for 1GPU, if you have multiple gpus you can pass --devices=[0,1] to use only on gpu 0 and 1, or --devices=[0] or --devices=[1] to use single corresponding gpu.

Thanks @usimarit Here is my train.py. With the mirrorstrategy, the training process waits forever after loading the cudnn library 3 time for the 3 GPU cards and does not proceed to the horizontal progress bar. During the forever waiting, the GPU memory and utilization are fully saturated even if I install the nccl. I think they are busy with something else. So I have to remove the strategy, any hints on why the mirrorstrategy keeps waiting and does not proceed with training? The current train.py can run with only the first GPU card, i.e., the [0].

import os
import fire
import math
from tensorflow_asr.utils import env_util

logger = env_util.setup_environment()
import tensorflow as tf

from tensorflow_asr.configs.config import Config
from tensorflow_asr.helpers import featurizer_helpers, dataset_helpers
from tensorflow_asr.models.transducer.conformer import Conformer
from tensorflow_asr.optimizers.schedules import TransformerSchedule

DEFAULT_YAML = os.path.join(os.path.abspath(os.path.dirname(__file__)), "config.yml")

def main(
    config: str = DEFAULT_YAML,
    tfrecords: bool = False,
    sentence_piece: bool = False,
    subwords: bool = True,
    bs: int = None,
    spx: int = 1,
    metadata: str = None,
    static_length: bool = False,
    devices: list = [0,1,2],
    mxp: bool = True,
    pretrained: str = None,
):
    tf.keras.backend.clear_session()
#    tf.config.optimizer.set_experimental_options({"auto_mixed_precision": mxp})

    config = Config(config)

    speech_featurizer, text_featurizer = featurizer_helpers.prepare_featurizers(
        config=config,
        subwords=subwords,
        sentence_piece=sentence_piece,
    )

    train_dataset, eval_dataset = dataset_helpers.prepare_training_datasets(
        config=config,
        speech_featurizer=speech_featurizer,
        text_featurizer=text_featurizer,
        tfrecords=tfrecords,
        metadata=metadata,
    )

    if not static_length:
        speech_featurizer.reset_length()
        text_featurizer.reset_length()

    train_data_loader, eval_data_loader, global_batch_size = dataset_helpers.prepare_training_data_loaders(
        config=config,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        batch_size=bs,
    )

    conformer = Conformer(**config.model_config, vocabulary_size=text_featurizer.num_classes)
    conformer.make(speech_featurizer.shape, prediction_shape=text_featurizer.prepand_shape, batch_size=global_batch_size)
    if pretrained:
        conformer.load_weights(pretrained, by_name=True, skip_mismatch=True)
    conformer.summary(line_length=100)
    optimizer = tf.keras.optimizers.Adam(
        TransformerSchedule(
            d_model=conformer.dmodel,
            warmup_steps=config.learning_config.optimizer_config.pop("warmup_steps", 10000),
            max_lr=(0.05 / math.sqrt(conformer.dmodel)),
        ),
        **config.learning_config.optimizer_config
    )
    conformer.compile(
        optimizer=optimizer,
        experimental_steps_per_execution=spx,
        global_batch_size=global_batch_size,
        blank=text_featurizer.blank,
        run_eagerly=True,
    )

    callbacks = [
        tf.keras.callbacks.ModelCheckpoint(**config.learning_config.running_config.checkpoint),
        tf.keras.callbacks.experimental.BackupAndRestore(config.learning_config.running_config.states_dir),
        tf.keras.callbacks.TensorBoard(**config.learning_config.running_config.tensorboard),
    ]

    conformer.fit(
        train_data_loader,
        epochs=config.learning_config.running_config.num_epochs,
        validation_data=eval_data_loader,
        callbacks=callbacks,
        steps_per_epoch=train_dataset.total_steps,
        validation_steps=eval_dataset.total_steps if eval_data_loader else None,
    )

if __name__ == "__main__":
    os.environ["CUDA_VISIBLE_DEVICES"]="1"
    main()

nglehuy commented 2 years ago

@liuyibox Did the training pass to the stage where the model's summary is printed (when mirror strategy is applied)? The run_eagerly=True make the model training is not wrapped in tf.function, therefore slow down the training, eagerly should be used for debugging only.

liuyibox commented 2 years ago

This issue is solved when I use "save_weights_only: True". I will open another issue for the mirrorstrategy training issue. Thank you.

TensorSpeech / TensorFlowASR

TypeError: Unable to serialize 144.0 to JSON. Unrecognized type <class 'tensorflow.python.framework.ops.EagerTensor #265

Layer (type) Output Shape Param #

Layer (type) Output Shape Param #

Layer (type) Output Shape Param #

Layer (type) Output Shape Param #