Error while training the tacotron-2 model.!

3rang commented 3 years ago

I followed the commands as per README, but still, I'm facing some issues. maybe (i don't have any graphics card in my systems, and I have installed all dependencies as per documents.) Please find the below error log.

this same issue reported with this link, I tried with the same comments as well but it didn't work for me.

Error log

2021-05-28 10:43:11.222119: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-05-28 10:43:11.222256: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-05-28 10:43:14.295663: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-05-28 10:43:14.456815: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-05-28 10:43:14.456918: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (cmtcleu60845047): /proc/driver/nvidia/version does not exist
/lhome/tarpate/.local/lib/python3.6/site-packages/numba/errors.py:137: UserWarning: Insufficiently recent colorama version found. Numba requires colorama >= 0.3.9
  warnings.warn(msg)
2021-05-28 10:43:17.625769: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-05-28 10:43:17.656911: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2099940000 Hz
2021-05-28 10:43:17.657843: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5d71920 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-05-28 10:43:17.657875: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-05-28 10:43:52,920 (train_tacotron2:422) INFO: hop_size = 256
2021-05-28 10:43:52,920 (train_tacotron2:422) INFO: format = npy
2021-05-28 10:43:52,920 (train_tacotron2:422) INFO: model_type = tacotron2
2021-05-28 10:43:52,920 (train_tacotron2:422) INFO: tacotron2_params = {'dataset': 'ljspeech', 'embedding_hidden_size': 512, 'initializer_range': 0.02, 'embedding_dropout_prob': 0.1, 'n_speakers': 1, 'n_conv_encoder': 5, 'encoder_conv_filters': 512, 'encoder_conv_kernel_sizes': 5, 'encoder_conv_activation': 'relu', 'encoder_conv_dropout_rate': 0.5, 'encoder_lstm_units': 256, 'n_prenet_layers': 2, 'prenet_units': 256, 'prenet_activation': 'relu', 'prenet_dropout_rate': 0.5, 'n_lstm_decoder': 1, 'reduction_factor': 1, 'decoder_lstm_units': 1024, 'attention_dim': 128, 'attention_filters': 32, 'attention_kernel': 31, 'n_mels': 80, 'n_conv_postnet': 5, 'postnet_conv_filters': 512, 'postnet_conv_kernel_sizes': 5, 'postnet_dropout_rate': 0.1, 'attention_type': 'lsa'}
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: batch_size = 32
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: remove_short_samples = True
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: allow_cache = True
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: mel_length_threshold = 32
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: is_shuffle = True
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: use_fixed_shapes = True
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: optimizer_params = {'initial_learning_rate': 0.001, 'end_learning_rate': 1e-05, 'decay_steps': 150000, 'warmup_proportion': 0.02, 'weight_decay': 0.001}
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: gradient_accumulation_steps = 1
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: var_train_expr = None
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: train_max_steps = 200000
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: save_interval_steps = 2000
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: eval_interval_steps = 500
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: log_interval_steps = 200
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: start_schedule_teacher_forcing = 200001
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: start_ratio_value = 0.5
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: schedule_decay_steps = 50000
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: end_ratio_value = 0.0
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: num_save_intermediate_results = 1
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: train_dir = ./dump_ljspeech/train/
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: dev_dir = ./dump_ljspeech/valid/
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: use_norm = True
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: outdir = ./examples/tacotron2/exp/train.tacotron2.v1/
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: config = ./examples/tacotron2/conf/tacotron2.v1.yaml
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: resume = 
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: verbose = 1
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: mixed_precision = False
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: pretrained = 
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: version = 0.0
2021-05-28 10:43:52,921 (train_tacotron2:422) INFO: max_mel_length = 870
2021-05-28 10:43:52,922 (train_tacotron2:422) INFO: max_char_length = 188
Model: "tacotron2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
encoder (TFTacotronEncoder)  multiple                  8218624   
_________________________________________________________________
decoder_cell (TFTacotronDeco multiple                  18246402  
_________________________________________________________________
post_net (TFTacotronPostnet) multiple                  5460480   
_________________________________________________________________
residual_projection (Dense)  multiple                  41040     
=================================================================
Total params: 31,966,546
Trainable params: 31,956,306
Non-trainable params: 10,240
_________________________________________________________________
[train]:   0%|                                                                                                                | 0/200000 [00:00<?, ?it/s]2021-05-28 10:43:58.743720: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:906] Skipping loop optimization for Merge node with control input: cond/branch_executed/_8
2021-05-28 10:44:08.686238: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 9887 of 12445
2021-05-28 10:44:11.700000: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:221] Shuffle buffer filled.
Traceback (most recent call last):
  File "examples/tacotron2/train_tacotron2.py", line 514, in <module>
    main()
  File "examples/tacotron2/train_tacotron2.py", line 506, in main
    resume=args.resume,
  File "/lhome/tarpate/.local/lib/python3.6/site-packages/tensorflow_tts/trainers/base_trainer.py", line 1010, in fit
    self.run()
  File "/lhome/tarpate/.local/lib/python3.6/site-packages/tensorflow_tts/trainers/base_trainer.py", line 104, in run
    self._train_epoch()
  File "/lhome/tarpate/.local/lib/python3.6/site-packages/tensorflow_tts/trainers/base_trainer.py", line 126, in _train_epoch
    self._train_step(batch)
  File "examples/tacotron2/train_tacotron2.py", line 109, in _train_step
    self.one_step_forward(batch)
  File "/lhome/tarpate/.local/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/lhome/tarpate/.local/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 840, in _call
    return self._stateless_fn(*args, **kwds)
  File "/lhome/tarpate/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2829, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/lhome/tarpate/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
    cancellation_manager=cancellation_manager)
  File "/lhome/tarpate/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/lhome/tarpate/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 550, in call
    ctx=ctx)
  File "/lhome/tarpate/.local/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.CancelledError:    [_Derived_]Loop execution was cancelled.
     [[{{node while_19/LoopCond/_54}}]]
     [[tacotron2/encoder/bilstm/backward_lstm/PartitionedCall]] [Op:__inference__one_step_forward_23971]

Function call stack:
_one_step_forward -> _one_step_forward -> _one_step_forward

[train]:   0%|                                                                                                                | 0/200000 [00:27<?, ?it/s]

ZDisket commented 3 years ago

2021-05-28 10:43:11.222119: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory Looks like it's failing to find some CUDA libs, I always had this error when training on a V100 instance unless I pasted this into the terminal: export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} You might have to replace depending on where your libs are located, but normally it should work like that. Paste and execute the command and try training again.

dathudeptrai commented 3 years ago

@3rang I saw this bug in your log:

Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory

Can you fix this bug and run again ?

ruianfrp commented 3 years ago

When I use GPU for training, I will report an error. Is my graphics card too low-level? The error is as follows. aaa6

My graphics cards are gtx1650 python 3.7 CUDA 10.1 cudnn 7.6.5 tensorflow-gpu 2.3

After that, I searched for some solutions on the Internet, but there is still no effective solution to this problem. So I set the parameter "batch_size" as small as possible, training can start to run, but after training for a while, it will report an error and stop training. The error is as follows.

QQ图片20210604161102

dathudeptrai commented 3 years ago

@ruianfrp what is a total memory of ur GPU ?

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

TensorSpeech / TensorFlowTTS