confusion with speaker encoder and loss func

andylida commented 4 years ago

thx for this code and i didn't find any implement of speaker encoder in demo is that unseen in this demo?

and in the loss func i cant figure out the difference bewteen L'recon' and L'recon0'

thanks a lot for guides

auspicious3000 commented 4 years ago

Please refer to #24 for speaker encoder. You don't need speaker encoder if you don't do zero-shot conversion. "During training, reconstruction loss is applied to both the initial and final reconstruction results"

andylida commented 4 years ago

So is that the initial one goes after lstmX3 and final one was tuned by residual block?

and in code what is emb_org and emb_trg? i thought the emb_trg is from speaker encoder, which is the output given emb_org as input if so,while converting,why concat the sourse speaker and target speaker to feed content encoder in code? the target speaker didnt give out any content

thanks for guides

arunasank commented 4 years ago

@andylida did you understand the difference between L_recon and L_recon0?

CODEJIN commented 4 years ago

@arunasank As I understand L_recon is after postnet mel and L_recon0 is pre postnet mel. Tacotron 2 used the similar postnet structure and loss.

Trebolium commented 3 years ago

Hi @CODEJIN. I have read the AutoVc and Tacotron papers. However neither seem to provide much information about why a postnet is used in the first place. Where can I learn more about this? I am wondering why it is necessary - because when I attempt to train my AutoVc models, the postnet very quickly is trained to output nothing but low 0-mean values while the prenet output generates all the mel spectrograms visible detail.

Currently with the AutoVc models I have trained, the postnet is only providing a very faint output. In the diagrams below, the 1st row shows the original x_input data, 2nd is the L_recon0, 3rd is L_recon (if the images were scaled between 0 and 1, these images would be almost totally blackened) and 4th is combined prenet and postnet output which looks identical to the 2nd row.

500000iterations

The postnet seems to output nothing but seemingly negligible values after 10k iterations (the figure shown is actually after 584k iterations). Does anyone have any thoughts on this? I would love to know where I can learn more about the use of prenets. Thanks for taking the time to read this far if so! 👍

CODEJIN commented 3 years ago

@Trebolium The Tacotron2 paper does not specifically mention the purpose of Postnet. My personal guess is that it increases the detail of the mel. Postnet is a residual structure(postnet = f(x) +x), so it provides little additional information to the pre-mel. As a result, postnet increases the detail of the mel, and in the case of the trained model, postnet usually shows lower loss than prenet.

Trebolium commented 3 years ago

Do you know where I could learn more about postnet implementation? Its a tricky thing to just google. Thanks for replying so quickly!

On Sun, Dec 13, 2020 at 3:16 AM Heejo You notifications@github.com wrote:

@Trebolium https://github.com/Trebolium The Tacotron2 paper does not specifically mention the purpose of Postnet. My personal guess is that it increases the detail of the mel. Postnet is a residual structure(postnet = f(x) +x), so it provides little additional information to the pre-mel. As a result, postnet increases the detail of the mel, and in the case of the trained model, postnet usually shows lower loss than prenet.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/auspicious3000/autovc/issues/29#issuecomment-743941087, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIMKTH4BCVSOJ4BDZUNCIC3SUQWXNANCNFSM4I4XVIXQ .

ruclion commented 3 years ago

Hi @CODEJIN. I have read the AutoVc and Tacotron papers. However neither seem to provide much information about why a postnet is used in the first place. Where can I learn more about this? I am wondering why it is necessary - because when I attempt to train my AutoVc models, the postnet very quickly is trained to output nothing but low 0-mean values while the prenet output generates all the mel spectrograms visible detail.

Currently with the AutoVc models I have trained, the postnet is only providing a very faint output. In the diagrams below, the 1st row shows the original x_input data, 2nd is the L_recon0, 3rd is L_recon (if the images were scaled between 0 and 1, these images would be almost totally blackened) and 4th is combined prenet and postnet output which looks identical to the 2nd row.

The postnet seems to output nothing but seemingly negligible values after 10k iterations (the figure shown is actually after 584k iterations). Does anyone have any thoughts on this? I would love to know where I can learn more about the use of prenets. Thanks for taking the time to read this far if so! 👍

Tacotron's postnets is beacuase: tacotron generate mels first by auto-regressive, mels just condition mels before. And in another way of understanding, mels before postnet need to do two things: (1) mel's content (2) for regressive; But actrually mels are (1) Related to both front and back (2) just need to model mel's content. So use a CNN model as postnet, to make mels better.

But author's NN may not need postnet if LSTM is Two-way. And I find that this code is not the same as paper's postnet, this code has lstm in postnet.

CODEJIN commented 3 years ago

@ruclion Interesting. Now I don't know the clear reason. I think it would be better to ask the author of the paper (the owner of this repository).... :) And, please let me know where the LSTM is in this code.... When I checked this repo, the postnet was here which there are only several convolution layer.

ruclion commented 3 years ago

@ruclion Interesting. Now I don't know the clear reason. I think it would be better to ask the author of the paper (the owner of this repository).... :) And, please let me know where the LSTM is in this code.... When I checked this repo, the postnet was here which there are only several convolution layer.

yeah, you are right. author's postnet is CNN, no LSTM. haha~ thank you~

auspicious3000 / autovc

confusion with speaker encoder and loss func #29