Question regarding paper

hash2430 commented 4 years ago

Hello, First, I apologize if this is not a proper channel to ask about your paper "MELLOTRON: MULTISPEAKER EXPRESSIVE VOICE SYNTHESIS BY CONDITIONING ON RHYTHM, PITCH AND GLOBAL STYLE TOKENS Rafael Valle, Jason Li, Ryan Prenger, Bryan Catanzaro NVIDIA Corporation".

According to this paper Table 1, it is giving GPE of 0.08% for both single and multi-speaker case. I tried to replicate this but it didn't go well. VDE and FFE were replicated but not GPE. My question is, what did you use for denominator in your equation for GPE? Accordingto the "Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron" paper, GPE uses the number of frames that are voiced in both geneated and reference signal. However, other metrics in this paper, such as VDE and FFE uses the number of all frames.

It would make sense to me if you used the number of all frames to calculate GPE in Table 1 of "Mellotron" paper. I am sorry if this question is stupid and I'm just being silly.

Thanks!

hash2430 commented 4 years ago

Uh,, now I am pretty sure you used the number of all the frames for the denominator of GPE equation because your GPE and VDE sums to FFE.. but the paper you referenced for GPE "A method for fundamental frequency estimation and voicingdecision: Application to infant utterances recorded in realacoustical environments" also uses number of voiced frames as denominator..

rafaelvalle commented 4 years ago

We use the equations described in "Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron"

hash2430 commented 4 years ago

But that one also has the number of voiced frames as denominator

blisc commented 4 years ago

Thanks for bringing this to our attention. You are correct, we used the wrong denominator in calculating GPE. Our new GPE numbers are as follow: Mellotron LJS-Sally: 0.08% -> 0.26% Mellotron LibriTTS: 0.08% -> 0.42%

We will update the paper shortly

NVIDIA / mellotron

Question regarding paper #43