Error occurred during extracting mel-spectrogram

wlsdbtjr commented 4 months ago

Thank you for your interesting and valuable research. I'm having trouble running the following command in terminal:

bash extract_fbank.sh --stage 0 --stop_stage 2 --nj 16

The sampling rate of the original ljspepech dataset is 22050Hz, but an error seems to have occurred in the process of downsampling it to 16kHz.

This is the error message written in 'exp/make_fbank/ljspeech/train/make_fbank_train.*.log'.

`Traceback (most recent call last):
  File "path/to/VoiceFlow-TTS/utils/compute-fbank-feats.py", line 105, in <module>
    main()
  File "/path/to/VoiceFlow-TTS/utils/compute-fbank-feats.py", line 86, in main
    assert rate == args.fs
AssertionError
# Accounting: time=2 threads=1
# Ended (code 1) at Tue 09 Apr 2024 02:01:15 AM UTC, elapsed time 2 seconds`

Thank you.

cantabile-kwok commented 4 months ago

In this case, please change the following line in extract_fbank.sh to match your sampling rate (22050Hz).

https://github.com/X-LANCE/VoiceFlow-TTS/blob/248c822fd34270b44d4664a68ce2f6a177980f27/extract_fbank.sh#L5C1-L5C48

wlsdbtjr commented 4 months ago

Thank you for your response. However, even after modifying the code as suggested, I encountered an issue where the duration and mel shape did not match during training.

The solution was just converting all data to 16kHz before training, as described in your paper. Thank you.

cantabile-kwok commented 4 months ago

Oh this is because the change of sampling rate will lead to the proportional change of frame shift and frame length accordingly. Sorry I forgot about that earlier. The provided durations are correspondent to the current setting in sampling rates, frame shifts and frame lengths, so they cannot be directly used with different configurations.

Glad to hear that you solved the problem by downsampling. If you have any other problems, feel free to open them.

kelvinqin commented 4 months ago

Thank you for your interesting and valuable research.

In my experiment, I also did the down-sampling (22050->16000),

Then run: bash extract_fbank.sh --stage 0 --stop_stage 2 --nj 4

And then run: python train.py -c configs/lj_16k_gt_dur.yaml -m lj_16k_gt_dur

But then I got the following complain: AssertionError: Frame length mismatch: utt LJ012-0035, dur: 443, mel: 447

The only solution is to skip line 187 of data_loader.py, I am not sure if this is fine? Thanks!

cantabile-kwok commented 4 months ago

@kelvinqin I believe your process is correct. This mismatch is also a common thing to notice in my other experiments. Since the difference between frames is only 4 (eq. 64ms in this setting), we can still tolerate this, because the durations and mel-spectrograms come from different programs and their framing algorithm might be slightly different. In this case, a common approach is to truncate the mel-spectrograms to the length of the duration. You can add some tolerance threshold to see if the mel length <= duration sum + tolerance; if so, then just discard the last several frames of mel.

But just skipping this line might still be unsafe, because in training, the upsampled text conditions still need to meet the length of the mel sequence. So adding the above truncating process would be better.

kelvinqin commented 4 months ago

@cantabile-kwok thanks so much for your suggestion, I will follow that in my experiments, :-) Kelvin

NathanWalt commented 1 month ago

Thank you for your explanation of the reasons of the problems, which I encountered myself. I'm trying to train and test your model on 22.05kHz data for comparison with other models, so I'm afraid the mismatch could affect the model's performance. Is there a neat solution to the mismatch problem, like adjusting the parameter of MFA?

cantabile-kwok commented 1 month ago

@NathanWalt Yes, a neat solution is to adjust the parameters in MFA alignment extraction. The workflow of the whole thing should be this: Determine the audio processing parameters (in your case, 22050Hz sampling rate, a certain number of frame shift points, window length, fmax and fmin in mel extraction) -> Use these parameters to extract audio features and train the MFA and get the corresponding alignments with regards to a certain frame shift -> Have a corresponding vocoder from this set of features -> Train the TTS acoustic model.

Usually, for 22050Hz speech data, I remember there are some publicly open HifiGAN checkpoints with 256 frame shift points and 1024 window length. If you have a vocoder ready, you can use the corresponding parameters for MFA.

NathanWalt commented 1 month ago

@NathanWalt Yes, a neat solution is to adjust the parameters in MFA alignment extraction. The workflow of the whole thing should be this: Determine the audio processing parameters (in your case, 22050Hz sampling rate, a certain number of frame shift points, window length, fmax and fmin in mel extraction) -> Use these parameters to extract audio features and train the MFA and get the corresponding alignments with regards to a certain frame shift -> Have a corresponding vocoder from this set of features -> Train the TTS acoustic model.

Usually, for 22050Hz speech data, I remember there are some publicly open HifiGAN checkpoints with 256 frame shift points and 1024 window length. If you have a vocoder ready, you can use the corresponding parameters for MFA.

Thank you for your advice. I've set the parameters for the extract_fbank.sh as you mentioned and use the english_us_arpa pretrained model for MFA (The process is similar to the one in https://gist.github.com/NTT123/12264d15afad861cb897f7a20a01762e, except that I use the transcipt in the metadata.csv file and the original radios in 22.05kHz). However, there is still some weird mismatch: the duration of all phonemes obtained in MFA is about 3 to 8 frames longer than the mel spectrogram generated by extract_fbank.sh. I've adopted truncation for the moment. I wonder whether you've encountered such problem before.

cantabile-kwok commented 1 month ago

@NathanWalt Hmm, I've experienced the length mismatch, but the mismatch is not as large as 8 frames (in mine case usually 2-3 frames). If the your parameters are correctly set, then I guess truncation might still work in your case.

NathanWalt commented 1 month ago

@cantabile-kwok Thanks for your patience and advice! I'll adopt truncation and see what happens after training the model.

X-LANCE / VoiceFlow-TTS

Error occurred during extracting mel-spectrogram #11