facebookresearch / AudioDec

An Open-source Streaming High-fidelity Neural Audio Codec
Other
444 stars 21 forks source link

The test results are different from those in the paper #12

Closed WzyCrush closed 4 months ago

WzyCrush commented 11 months ago

I tested clean_testset_wav audio in the Valentini dataset according to the description in your paper. The model used is vctk_v1, and the test results are as follows test_result The test results are quite different from those in your paper. Can you release the calculation code?

bigpon commented 11 months ago

Hi, The evaluation codes are mostly from the sprocket-vc repo.

We used their feature extractor to extract f0, u/v segments and mcep. We used their melcd function to calculate MCD. Since the WORLD vocoder is sensitive to speaker differences, we set different f0 search ranges for the two test speakers. If you don't carefully set them, it usually results in higher f0 extraction errors. Please refer to their example to correctly extract the features.

Moreover, for F0RMSE, we only calculate the voiced segments.
Based on your results, I assume you calculate the F0RMSE of the whole utterance (including both voiced and unvoiced parts).

For the MCD part, we first conduct VAD to remove the silence parts and calculate it excluding the energy term which is the first dimension of mcep. If you directly calculate the whole utterance's MCD (including the silence parts), it tends to result in a much lower MCD.

WzyCrush commented 11 months ago

Hi, thank you very much for your response. I followed the instructions you provided to perform the following operations:

  1. Draw histograms of p232 and p257 according to the example, and determine the F0min and F0max of p232 and p257 based on the histograms. Set F0min and F0max of p232 to 40 and 240, respectively, and set F0min and F0max of p257 to 40 and 390, respectively. The histogram is shown below.
  2. Use the webrtcvad to detect the unvoiced segments in the audio files (including the original audio file and the reconstructed audio file), and then remove the unvoiced segments. The example below shows an instance of removing unvoiced segments(audio file waveform diagram).
  3. Calculate the F0 of the original audio and the reconstructed audio(removed the unvoiced segments audio) based on F0min and F0max, and then calculate the F0RMSE.

My calculated results are still significantly different from those in the paper. I used the webrtcvad to implement VAD. Is there any problem with this step? Could you tell me how you implemented VAD? If there are any problems with the entire calculation process, please tell me. 直方图 wave

bigpon commented 11 months ago

Hi, according to the figures, the correct f0 search range of P232 should be around 70 -240 Hz, and that of P257 should be 140-340. (The behind theory is that the f0 extractor is not perfect and most f0 should follow one Gaussian distribution. Therefore, the other peaks in the figures might be the wrong f0s caused by the f0 extraction errors.) Please check the tutorial in the Sprocket repo carefully. In addition, since speech also includes unvoiced parts, it is incorrect to use VAD to get the voiced/unvoiced indexes. The U/V segments should be determined by the extracted f0s.

The VAD function is only for the spectral distortion measurements (MCD etc.). I used the VAD provided by the Sprocket repo, which requires handcraft setting the power threshold. Please refer to their tutorial, too.

WzyCrush commented 10 months ago

Hi, Thank you very much for your reply. I set the F0 search range of P232 to 70-240Hz and P257 to 140-340Hz. I regard the part where F0 is equal to 0 as unvoiced segment, and the non-zero part as voiced segment. The U/V errors calculation results are as follows.

F0RMSE is calculated using the parts of both the original audio and the reconstructed audio that are voiced segments. The calculation results are as follows.

Use the get_alignment() function in the Sprocket repo to calculate the MCD. The power threshold uses the default value -20 in the function. The calculation results are as follows. image

There may be some errors in my calculation process, and the calculation results are still different from those shown in the paper. Thanks again for your reply.

bigpon commented 10 months ago

Hi, I just checked my code and sorry that I forgot to mention some details.

  1. I extract “mcep” and “f0” features before doing any VAD and DTW.

The settings of the f0 extraction are

p232: f0_max: 240 f0_min: 65 pow_th: -22 p257: f0_max: 390 f0_min: 120 pow_th: -15

Please note that the power threshold should be adjusted according to the statistical results (The tutorial has mentioned how to do it.)

The settings of the mcep extraction are

sampling_rate: 48000 # Sampling rate. fft_size: 2048 # FFT size. hop_size: 240 # Hop size. win_length: 2048 # Window length.

If set to null, it will be the same as fft_size.

window: "hann" # Window function. shiftms: 5 # Frame shift (ms) mcep_dim: 49 # Mcep dim highpass_cutoff: 70 # Cutoff frequency of preprocessing highpass filter mcepshift: 1

  1. After that, for the MCD calculations, I first get the speech and silence indexes of the reference (natural) utterance by VAD. I use the VAD index for both generated and reference mceps to get the speech mceps. I perform DTW of them and make the generated ones align with the reference ones. The first dimension of mcep (the power) is excluded from the MCD calculations.

  2. For the f0RMSE, I first use the VAD and time-warping indexes of 2 to get the aligned f0s. Then, I calculate the f0RMSE of the overlapped voice parts, which have f0 != 0.0, of these two aligned f0 sequences.

WzyCrush commented 10 months ago

Hi Thank you very much for your reply.

I followed the settings you mentioned to extract “f0” and “mcep”.

For the MCD calculations, I first get the speech indexes via VAD and then the speech mceps via the indexes. Then align mecp through DTW, and finally calculate MCD(the first dimension of mcep has been excluded). The calculated result of MCD is 4.76, which is still different from the result in the paper.

In the sprocket-vc, I did not find the code for time-warping the speech indexes of f0. Does the alignment of f0 also use DTW?

I determined the U/V segments through the extracted f0, and I extracted f0 according to the settings you mentioned, but the calculated U/V error is 10.56%. The result is different from this in the paper.

There may still be errors in some of the details of my calculations. It would be greatly appreciated if you could release your evaluation codes.

bigpon commented 10 months ago

Hi, As I mentioned in point 3, I used the time-warping results of the mcc for f0 alignment. (Directly doing alignment on f0 doesn't make sense. You have to find the alignment according to the spectral similarities.) WORLD feature extractions include some randomness, so the results may be slightly different in different devices.

P.S. We might not be able to open source the code that is not mostly written by ourselves because Meta has some open-source protocols.