NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 372 forks source link

Choppy Voice with Tacotron - GST pretrained checkpoint #512

Open gurpreet395 opened 4 years ago

gurpreet395 commented 4 years ago

Hi, I am trying to do inference using pretrained model on single GPU machine. It gets the style right. However, the generated voice is of very bad quality. Please fing attached generated mel diagram and sample voice. I am not able to figure out what is the issue. Do I need to finetune it? https://github.com/gurpreet395/downloads/blob/master/Output_step0_0_infer.png https://github.com/gurpreet395/downloads/blob/master/sample_step0_0_infer.wav

In the config file, I am using "mel" instead of both Here is link to the model. https://drive.google.com/file/d/1IdWnUIwV9NMe-1JSvcv4Ti4HZ8wPEvQr/view

Following is the log from the inference job.

Inference Mode. Loss part of graph isn't built. 2019-11-05 00:40:05.085872: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz 2019-11-05 00:40:05.086239: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x6d63b30 executing computations on platform Host. Devices: 2019-11-05 00:40:05.086271: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): , 2019-11-05 00:40:05.186390: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-11-05 00:40:05.187310: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x6d65c00 executing computations on platform CUDA. Devices: 2019-11-05 00:40:05.187336: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): Tesla T4, Compute Capability 7.5 2019-11-05 00:40:05.187479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59 pciBusID: 0000:00:1e.0 totalMemory: 14.75GiB freeMemory: 14.65GiB 2019-11-05 00:40:05.187504: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0 2019-11-05 00:40:05.510047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-11-05 00:40:05.510091: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 2019-11-05 00:40:05.510100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N 2019-11-05 00:40:05.510211: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14164 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5) WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/util/decorator_utils.py:145: GraphKeys.VARIABLES (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.GraphKeys.GLOBAL_VARIABLES instead. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to check for files with this prefix. WARNING: Can't compute number of objects per step, since train model does not define get_num_objects_per_step method. 2019-11-05 00:40:06.334238: I tensorflow/stream_executor/dso_loader.cc:153] successfully opened CUDA library libcublas.so.10 locally Processed 1/1 batches Processed 1/1 batches Not enough steps for benchmarking output_file is ignored for tts results are logged to the logdir Finished inference

blisc commented 4 years ago

Why not use both? The learned mel -> mag mapping would be better than mapping mel -> mag by using the transposed mel filter. It is also possible that the style wav is too different from the training data that it cannot generate a good style token from it.