Regarding mel start and end token

bkumardevan07 commented 3 years ago

Hey, I saw you are taking mel start /end val 4,-4 now i think its 0.5/-0.5. You are normalizing mel spectrogram in range -4 to +4 then dont you think using above token values will cause problems??

I trained on vctk today with r=1 from start. But my predictions are usually the pad values. Did you had any similar issue sometime? I am attaching some images if u can help me in some way... test

train

I padded the wav files with 12.5msec duration with value 0

sanghuynh1501 commented 3 years ago

hello @bkumardevan07 did you solve the problem?

bkumardevan07 commented 3 years ago

No, but by increasing batch size I have seen this problem fading away but still trying to get reasonable output. Please note that I had modified the architecture to include multispeakers and also pad value may not necessarily be the reason for above plots. Hope it helps

sanghuynh1501 commented 3 years ago

No, but by increasing batch size I have seen this problem fading away but still trying to get reasonable output. Please note that I had modified the architecture to include multispeakers and also pad value may not necessarily be the reason for above plots. Hope it helps

Thanks for your reply, someone told me guided attention can help, have you tried it yet?

cfrancesco commented 3 years ago

Hi @bkumardevan07 if you start with r=1 you most likely will not get the alignment between text and audio. You can observe this in tensorboard in the last layer: if you heads (or at least one ehad) in the last layer do not look roughly diagonal, then the model will fail to produce any reasonable output. you should see some initial alignments in the first 6K to 30K or so steps, depending on the config. Hi @sanghuynh1501 yes, it does massively. When training on a new dataset (very clean, well curated) I wasn't getting any alignments, until I forced a diagonal loss. You can find the code for this under the branch dev. I'm still messing around with a lot of new things, so it might take a while before I bring this to master, but I definitely recommend trying it out.

Generally to both of you I recommend to use the autorehressive model as a model to extract the durations you need for the forward model. In the next version of the repo the autoregressive prediction will likely be remove entirely.

I hope this helps! If you have any other question feel free to ask, or open a new discussion. @bkumardevan07 I am also about to include multispeaker (in dev there is the pitch prediction too, forward model only), maybe it can be useful to open a discussion under the discussion panel on the topic.

iclementine commented 3 years ago

You can observe this in tensorboard in the last layer: if you heads (or at least one ehad) in the last layer do not look roughly diagonal, then the model will fail to produce any reasonable output.

That's an really interesting finding. But I also observe that when training autoregressive TTS model with multiple encoder-decoder attentions, (either multilayer, multiheaded or both), diagonal alignment tends to appear in shallower layer first, deeepr layers may not learn any diagonal alignments at all. Though by other methods like drop head, we can induce more diagonal alignments ( sometimes up to 12/16 of the attention heads are diagonal).

But do yo have some intuitive idea why the last layer of attention is "critial"? Maybe an attention layer that learns no alignment at all bring messy information from the encoder outputs to the decoder and hurts its performance?

And I am curious that, since a stable alignment is desired, more heads or more layers of attention introduce uncertainty, then is it a better idea to use single attention in the model like DCTTS, tacotron? In these models, diagonal alignment is always learned in the last attention layer since there is only one attention.

Thank you!

bkumardevan07 commented 3 years ago

@iclementine

In Neural speech ... paper, they have mentioned in their ablation study that the layers act as approximating the final function in Taylor expansion way due to residual connections. Hence the initial layers act as learning low-frequency information/structure, and as the layers go deeper, the model tries to learn the higher-order/ finer details of the function. In the recent literatures, it has been shown that some models find it difficult to learn high-frequency info, or initially, the model learns the lower frequency info and as the model starts converging, the model learns high-frequency information. In that context, I think that that the model failing to learn alignments in the last layer could be due to convergence. It might also be difficult to learn due to noisy data.

Regarding your question on why the last layer is critical, I am too unsure because we are anyway using skip connections that should take care of this. Also, have one question why query concatenation is necessary? because We are anyway using skip connections which will add query information.

About your question on the number of attention heads, I think you are right (based on the experiment result); adding more heads might bring uncertainty. In recent papers, authors have started using two attention heads only (because of the additional speaker vector concatenated with spectrogram).

Looking forward to more insights from someone.

iclementine commented 3 years ago

@bkumardevan07

Also, have one question why query concatenation is necessary? because We are anyway using skip connections which will add query information.

In transformer architecture, multihead attention is surrounded by an residual connection and layer normalization, so theoretically, Query and Attention outputs are already fused together by adding.

Concatenating quey and (Attention * Values) before applying the output affine layer is somehow more expressive than simply adding. As soobinseo / Transformer-TTS mentioned, concatenating the query is very important. Maybe it is because of experimental results. But deep voice 3, the attention outputs and the query are fused by simply adding.

Shallower layers learning low frequency informantion and taylor expansion analogy is intuitive. Thank you.

Nvidia has a model Centaur, which removes self attention and only use cross attention. (the number of heads of cross attention is 1). https://github.com/cpuimage/Transformer-TTS

as-ideas / TransformerTTS

Regarding mel start and end token #72