espnet / espnet

End-to-End Speech Processing Toolkit
https://espnet.github.io/espnet/
Apache License 2.0
8.35k stars 2.17k forks source link

Error running 'egs2/kss/tts1' recipe in ESPnet2 with KSS dataset #5118

Open wisev3 opened 1 year ago

wisev3 commented 1 year ago

Describe the bug I followed the official instructions to install ESPnet2 and attempted to run the 'egs2/kss/tts1' recipe using the provided KSS dataset as an example. However, I encountered an error due to a size mismatch between the input and target tensors. Please refer to the error log file at the bottom of this report. I would appreciate any assistance in resolving this issue.

Additionally, I am looking for a beginner-friendly tutorial on fine-tuning TTS tasks using ESPnet2. Do you have any recommendations?

Basic environments:

 - OS information: Ubuntu 18.04
 - python version: 3.8.16
 - espnet version: espnet 202301
 - Git hash: 424b79100e107e028155af318521f5ece0b30497
   - Commit date:  Fri Apr 14 11:16:57 2023 -0400
 - pytorch version: 1.10.1 [e.g. pytorch 1.4.0]

Environments from torch.utils.collect_env:

PyTorch version: 1.10.1
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2
Libc version: glibc-2.27

Python version: 3.8.16 (default, Mar  2 2023, 03:21:46)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.0-99-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.3.58
GPU models and configuration:
GPU 0: NVIDIA RTX A6000
GPU 1: NVIDIA RTX A6000
GPU 2: NVIDIA RTX A6000
GPU 3: NVIDIA RTX A6000

Nvidia driver version: 470.182.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.1
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] pytorch-ranger==0.1.1
[pip3] pytorch-wpe==0.0.1
[pip3] torch==1.10.1
[pip3] torch-complex==0.4.3
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==0.10.1
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               11.3.1               h2bc3f7f_2
[conda] mkl                       2021.4.0           h06a4308_640
[conda] mkl-service               2.4.0            py38h7f8727e_0
[conda] mkl_fft                   1.3.1            py38hd3c417c_0
[conda] mkl_random                1.2.2            py38h51133e4_0
[conda] numpy                     1.23.5           py38h14f4228_0
[conda] numpy-base                1.23.5           py38h31eccc5_0
[conda] pytorch                   1.10.1          py3.8_cuda11.3_cudnn8.2.0_0    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] pytorch-ranger            0.1.1                    pypi_0    pypi
[conda] pytorch-wpe               0.0.1                    pypi_0    pypi
[conda] torch-complex             0.4.3                    pypi_0    pypi
[conda] torch-optimizer           0.3.0                    pypi_0    pypi
[conda] torchaudio                0.10.1               py38_cu113    pytorch

Task information:

 - Task: TTS
 - Recipe: kss/tts1
 - ESPnet2

To Reproduce Steps to reproduce the behavior:

  1. move to a recipe directory, e.g., cd egs2/kss/tts1
  2. execute the following command, which is provided in README.
    --tts_task gan_tts \
    --fs 24000 \
    --fmin 0 \
    --fmax null \
    --n_fft 1024 \
    --n_shift 256 \
    --win_length null \
    --train_config conf/tuning/train_jets.yaml \
    --token_type phn \
    --g2p g2pk \
    --cleaner null \
    --ngpu 4

Error logs train.2.log

sw005320 commented 1 year ago

Thanks for the report. Can you tell me whether this error happens with a single GPU?

About TTS fine-tuning, you can refer to https://github.com/espnet/espnet/tree/master/egs2/qasr_tts/tts1

wisev3 commented 1 year ago

Thank you for your reply. I confirmed that the error occurred without using multi GPUs.

sw005320 commented 1 year ago

OK, thanks @kan-bayashi, do you have any comments?

wisev3 commented 1 year ago

@kan-bayashi , I would be grateful for any comment on my issue.

kan-bayashi commented 1 year ago

Sorry for the late reply. Maybe the dataset includes too short audio. In random windowed discriminator, we extract segment from the entire sequence using the following segment_size https://github.com/espnet/espnet/blob/fc37f80ff96070107c61a2a020f9627f82d646c5/egs2/kss/tts1/conf/tuning/train_jets.yaml#L77 Therefore, we assume that the audio length > shift size * segment size. You can remove such audios with --min_wav_duration 0.75 for run.sh.

wisev3 commented 1 year ago

I sincerely appreciate your answer. Training successfully started, and it seems to work. How long do I have to train the model for the KSS example? Do you have any suggestions?