A-Jacobson / tacotron2

pytorch tacotron2 https://arxiv.org/pdf/1712.05884.pdf
43 stars 15 forks source link

Information sharing #1

Open Rayhane-mamah opened 6 years ago

Rayhane-mamah commented 6 years ago

Hello @A-Jacobson.

Great work with your implementation and more importantly with you clear representation of the model in your README (100% better that the one presented in the paper x) ).

So I am actually also working on Tacotron 2 implementation (in tensorflow) and there are few things I wanted to check with you, maybe we could help each other out. (implementation here)

Again, impressive work.

A-Jacobson commented 6 years ago

Hey @Rayhane-mamah,

Thanks for the compliment. Though this is still very much a work in progress. I'll take a look at your work soon!

The attention mechanism doesn't seem to be working correctly as it is. The alignments are all over the place though the generated spectrograms look quite good after only 1 epoch. I believe this could be due to the use of teacher forcing. The paper just mentioned that they use it, they didn't mention any ratio so I have had it set to "always on" I know in NMT it's common to have a teacher forcing ratio of 0.5.

I'm planning to tackle the attention problems today, some of the potential issues could be:

Best of luck with your work!

Rayhane-mamah commented 6 years ago

Thanks for you quick reply @A-Jacobson.

About the teacher forcing, this is actually a nice perspective I haven't thought about since I only considered using the "always on" teacher forcing.

As for the attention mechanism, as far as I could understand from the paper, they extended the bahdanau's "sum style" attention to use cumulative location features as an extra (extracting them with the use of convolution and whatsoever). As far as I know, this requires the use of key, query and previous alignments, and if I'm not mistaking, this is the "hybrid" (content+location based) attention not only the location based one (just my point of view).

Best of luck to you too, if I ever find the right way to use the attention, you'll be the first to know!

A-Jacobson commented 6 years ago

Hmm, that's what I did.

previous context --> conv1d layers --> add.

To me "cumulative" would be some weighted sum on previous context vectors though I suppose that information is implicitly carried forward as each context vector is computer with information from all of the previous vectors.

Are you using zoneout and LSTMs in your project and still running into this problem?

Rayhane-mamah commented 6 years ago

Yeah i'm supposing that information is implicitly being carried forward since each context vector is computed using the previous one..

I am indeed using zoneout LSTMs (unless they're not working correctly) and still running into many problems in fact.. even when changing the attention to use some more basic one like Luong or Bahdanau's, mel outputs tend to be blurry and at some stage in training, the "before loss explodes" and attention is completely lost (tends to 0). I'm still not really sure about the reason, however, I think i should point out that I am using a separate LSTM for attention (with 128 units) and concatenating its output with the context vector before sending them to the decoder LSTMs (based on the original tacotron approach and untill now gave the most "normal" results compared to LuongAttention, before exploding of course).

A-Jacobson commented 6 years ago

My implementation appears to be working now. I ended up using the last decoder rnn hidden state as the query vector rather than the output of the prenet and fixed a malicious typo related to updating my decoder hidden state. These changes alone are giving reasonable results (The model learns to ignore the padding tokens at 50 steps).

It's odd that your attention is zero, or close to zero since each frame is being passed through a softmax layer. I would make sure you don't have zeros as input to the attention layer. The most likely culprit would be your hidden state (query vector). I'm also not sure having a separate lstm layer to generate the query hidden state is necessary since your hidden state is already the output of your lstm at the previous step in the loop. with regards to exploding gradients, try gradient clipping.

As for when we can expect to see alignments, it's supposedly around 15k steps see: https://github.com/keithito/tacotron/issues/90.

Once things appeared to be working I also rewrote my attention mechanism to exactly replicate the one from the begio paper and switched from grus to lstms to more closely represent tacotron2. The only thing I haven't yet added is zoneout.

Just out of curiosity, how are you padding your text/spectrograms and how much gpu memory does your implementation take per training batch?

Rayhane-mamah commented 6 years ago

Hello again @A-Jacobson, sorry for the late reply.

If your attention works, I would definitely switch to yours too, it seems cleaner (and let's face it, less layers = faster computation = happy me :tada: ).

With that said I managed to find the source of my problem. It appears that it was 100% related to my weights initialization. After going through all my layers and initializing my weights using the xavier initialization (to keep the same stddev along layers, preventing by that any vanishing or exploding gradients). Now after visualizing the gradients norm I can see that all signs of explosion are gone.

On the other hand, attention is working properly and I sometimes see it starting to form the right alignment. I am still using a separate LSTM however simply because in the first tacotron, they used a 256-GRU for attention which led me to interpret "The encoder output is consumed by an attention network which summarizes the full encoded sequence as a fixed-length context vector for each decoder output step." as if they used a separate LSTM in tacotron-2 as well. I might be wrong, that's why if your approach works fine, I won't really care about whether they used a separate LSTM or not as it would make a smaller network that yields the same results.

To answer your question, I am padding inputs (texts) with ("0" tokens) and padding the outputs (spectrograms) with the same ("0.0" tokens) from which the model is expected to learn to predict the true (using the linear_transform to scalar + sigmoid).

Finally, to answer your gpu memory question, I want to point out that i recently added the reduction factor (originally used in the first tacotron implementation) which consists of predicting "r" (reduction factor) frames simultaneously at each decoding step. This ensures that the model makes less decoding steps in training, reducing computation amount and freeing memory and seems to allow the model to capture alignment faster.

Since my main (more powerful gpu) is busy on another project, i'm only using a 920MX for training the tacotron-2 (which has 2Gb VRAM) and it only supports 12 batch size as a maximum (using the reduction factor r=5) but I suspect a 1080Ti would easily train the model with a batch size of 64. I hope this answered your question?

The comment is getting pretty long.. but just to make sure there isn't something wrong with my loss function, I saw that you are only using the MSE of decoder outputs (with no post-net?) and crossentropy of prediction in your loss function. I am doing the same with the exception of adding the "after-postnet" error too (seems to speed up convergence) along with an l2 regularization on all network weights.

The thing is my loss decreases amazingly fast (which seems odd actually) and then becomes "constant" in only 600 steps, mel-spectrogram quality continues to improve (along with audio quality, checked using a simple griffin-lim just to control linguistic improvement without paying much attention to audio quality in general). Is that supposed to be normal? I will try to share some tensorboard plots later (Just waiting for the alignment to appear before that)

A-Jacobson commented 6 years ago

@Rayhane-mamah, I'm glad you got your project working. I was also going to suggest tuning your learning rate or using cyclical learning rates. Since the paper didn't give weight init details and we aren't using the same batch size or dataset they are using in the paper, their parameters aren't going to be that relevant to us. I got the quickest alignments using the techniques in the stochastic gradient descent with warm restarts paper. My code for that has been pushed to this repo.

As for the loss, there's a screenshot of mine in the readme, the starting point is cut off but it usually starts at about 300.0. An exponential decrease like that isn't surprising to me since we are using MSE and our targets aren't z-normalized (at least mine aren't) . If you think about the magnitude of the loss between a random 120 x 700 matrix vs a true spectrogram, it's obvious that the value would start very high. Once the model starts outputting things in the correct range (blurry spectrograms) and has to make small adjustments (vocal patterns) the loss will naturally start decreasing much more slowly.

With regards to the attention mechanism, I seem to remember tacotron one using a double layered attention but as I've been focused on faithfully reproducing tacotron2 I don't know much about it. My attention works the same way as it would in an nmt model, I have an example using it for nmt here (https://github.com/A-Jacobson/minimal-nmt) which produces quite good alignments after about 10 epochs (15 mins).

It seems that I need to do some profiling or look into those reduction layers though.. my model is using about ~6x the memory you're reporting.

Rayhane-mamah commented 6 years ago

About the size of the model, i just had a quick look at your code and might have found some causes of such big differenve in the memory usage i'm experiencing.

I saw you using 1024 units in each of your decoder rnn layers, I am using 1024 for the two layers combined (512 each) and 128 in each prenet layer (256 for 2 layers). They may have meant 1024 units for each layer but if the model gives nice results with less units i'll prevent from adding more complexity to the model. But then again, this is a parameter we choose depending on the situation. Then comes the reduction factor that reduces the size of the model even more.

After reflexion, your attention seems to reproduce tacotron-2 much better. Will definitely try it out later. There's just one thing I wanted to check with you since i don't have much experience with pytorch. Are you using your postnet on each decoder step? Isn't it supposed to improve the output of decoder after all frames have been predicted?

As for hyper parameters, I also noticed the difference of our case with the paper, I toyed a little with the optimizer's params to minimize the loss shakes and will probably tune them more at a later stage.

Did your model generate any good sounding samples yet?

A-Jacobson commented 6 years ago

That's an interesting interpretation, I didn't think about that and without the authors code to refer to we can truly only guess.

With regards to the results, I haven't written up a griffin-lim. So I've just been looking at the output spectrograms vs targets and all I can say is that they look quite similar. I was planning to wait until my wavenet is done to listen to them.

fatchord commented 6 years ago

Hi guys, hope you don't mind me chiming in. Regarding this from the paper: "cumulative attention weights from previous decoder time steps" - my initial interpretation of that was to make a tensor of size [Batch, 1, EncoderTimeSteps] and cumulatively add the attention weights from each step to it. So the attention convolution would be looking at all attention locations it had previously contributed to. What do you think?

Looking at the Bengio paper - if I'm not mistaken, they only convolved the attention weights of the previous time step - that sounds very different to "cumulative attention weights"

A-Jacobson commented 6 years ago

Welcome @fatchord. I originally thought the same, but ended up following the equations in the bengio paper. I then realized that the attention weights are cumulative since we add the previous weights during the calculation.

Ignoring the other information, if:

weights = weights + weights[t-1] 

# and 

weights[t-1] = weights[t-1] + weights[t-2]

That fits the definition of cumulative to me. Does that make sense?

What still isn’t clear to me is if we backprop through the attention weights from the previous step or if we detach the weights from the graph. I’ve tried both and haven’t noticed much difference.

fatchord commented 6 years ago

Ah I think I see what you mean now - if the current weights are calculated from the last then this is cumulative, but it's a kind of transformed cumulation. Is that what you mean?

Regarding detaching the attention weights - my initial thinking would be to leave it in the graph as the attention is kinda like a recurrent net in itself right? I'm new to attention models so I'm not sure.

So in my own pytorch implementation I too have got very slow training times. I trained a big wavenet recently (took forever) but this part of tacotron2 is even slower - that's in direct conflict with what's stated in the paper - they wrapped up training in a day. Are they predicting multiple spectrograms per decoder time step? Surely they would have mentioned that?

A-Jacobson commented 6 years ago

They predict multiple frames in the first paper I believe. Though it was my understanding that this one is meant to be faster. Granted, this is a google paper that's full of rather googley things such as the use of an internal dataset, and unknown hardware (what kind of gpu can run this with a batch size of 64?). They also mention that they trained their wavenet "with a batch size of 128 distributed across 32 GPUs".

Honestly, unless I'm doing something terribly wrong I'm not likely to fully train this as it's supposed to take a few hundred thousand steps to fully converge. (I can only get a few thousand steps per hour with a smaller batch size). But it has been a decent deep dive into encoder --> attention --> decoder architectures.

As for the wavenet, do you have a link to your implementation? I just managed to crank what I think is a basic version but the details in the paper are rather sparse. I'm still not clear on how to get it to generate audio based on the input spectrogram. Do we add the full spectrogram as a conditional input?

Rayhane-mamah commented 6 years ago

Hello again fellas.

About the wavenet, based on r9y9 implementation ( https://github.com/r9y9/wavenet_vocoder/) he upsampled the mel spectros to use them as a local conditioning.

It's true that in the T2 paper things are left quite unexplained, especially the mixture of logistic distributions part.. Because of that, one has to follow their references to understand what's going on..

PS: I too am wondering what kind of gpu can run T2 with batch size 64 with NO reduction factor (they explicitely said they predict one frame at a time..)

On 13 Mar 2018 1:13 pm, "Austin" notifications@github.com wrote:

They predict multiple frames in the first paper I believe. Though it was my understanding that this one is meant to be faster. Granted, this is a google paper that's full of rather googley things such as the use of an internal dataset, and unknown hardware (what kind of gpu can run this with a batch size of 64?). They also mention that they trained their wavenet "with a batch size of 128 distributed across 32 GPUs".

Honestly, unless I'm doing something terribly wrong I'm not likely to fully train this as it's supposed to take a few hundred thousand steps to fully converge. (I can only get a few thousand steps per hour with a smaller batch size). But it has been a decent deep dive into encoder --> attention --> decoder architectures.

As for the wavenet, do you have a link to your implementation? I just managed to crank what I think is a basic version but the details in the paper are rather sparse. I'm still not clear on how to get it to generate audio based on the input spectrogram. Do we add the full spectrogram as a conditional input?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/A-Jacobson/tacotron2/issues/1#issuecomment-372644674, or mute the thread https://github.com/notifications/unsubscribe-auth/AhFSwP8KdAoXvtluWbRQy-_Aun9NDONlks5td7fRgaJpZM4ShIIJ .

A-Jacobson commented 6 years ago

Right, the mixture of logistic encoding is from pixelcnn++.

I just built a baby wavenet that’s generating sine waves right now. It seems I have to add local conditioning.

My understanding right now is that.

Training:

Wavenet(full_audio, full_spectrogram) Output = full_decoded_audio

-Spectrogram is upsamped to the same length as the audio (undo fft hops)

-Upsampled spec used as local context for each wavenet block.

Interence:

Wavenet(audio_start_token, spec_frame?) Output = single audio frame?

As you can see I’m not too clear on the behavior just yet. To me the original wavenet paper left a lot of details out as well. Including the number of filters in all their layers! Though I believe I found some of that info in one of the authors twitter.. haha.

Rayhane-mamah commented 6 years ago

You work fast x)

I am not really sure about inference time as I am still writing the training part so I won't misguide you. I should be able to finish the entire thing this week end (right after i finish those exams..) As for the training part, I share the same understanding.

I will keep you informed if I find anything useful.

fatchord commented 6 years ago

I've had decent enough results with my wavenet. I haven't implemented the mixed logistics because I wanted to replicate the sound quality of the original wavenet paper first. Actually it'd be great to get your thoughts on the sound quality: testset.tar.gz

Obviously there's noise from the 8bit encoding but besides that all I can hear just a little bit of phasey/flangey noise around the top-end.

My implementation is basically a gigantic jupyter notebook right now so it badly needs refactoring. Once I get around to that (I'm busy mainly with WaveRNN right now), I'll upload it to github.

fatchord commented 6 years ago

Oh I almost forgot - for wavenet hyperparams have a look towards the end of the distilled wavenet paper - in section 5 - Experiments they give details there.

A-Jacobson commented 6 years ago

Thanks for the pointers with regards to wavenet the diagrams from the original paper led me to believe that the kernel size should always be 2! Ironically, the parallel wavenet paper was the only wavenet paper I hadn't read in depth since i thought it was just about speedups.

I'm having trouble playing that sound file on this computer, says the format isn't supported though I'm not sure my ears would be able to notice anything about it that yours couldn't anyway.

I also don't mind giant jupyter notebooks too much since I'm just looking for hyper parameters and small details so please feel free to share your implementation. That being said, is the audio you generated conditioned on spectrograms or is it using the linguistic features from the original paper? Also are you using fast generation queues or parallel wavenet? I've found that generation of really anything (even a sine wave) with a naive approach is prohibitively slow.

fatchord commented 6 years ago

Yeah, I should really write a proper wav saving function - librosa does something strange sometimes when saving - yet another reason why I need to refactor the entire thing. In the meantime I recommend you check out the r9y9 and kan-bayashi repos, both have legit implementations.

As for conditioning - I'm using mel spectrograms. Be extremely careful with mel/sample alignment - that's something that tripped me up initially. Also I'm using fast-queues - it's not that fast though but it does cut down on naive generation by a factor of around 4 in my experience. That's why I'm so interested in WaveRNN right now - it took around 20 mins to generate the little sample I uploaded earlier - totally impractical.

A-Jacobson commented 6 years ago

Ya, the fast queues only increase generation time in proportion to the number of layers in your network and they mentioned it isn't much faster (2x maybe) unless you're using more than 10 layers or so. The original is O(2^L), fast queues are O(L) but parallel wavenet claims real time performance. Perhaps that's worth a look. I don't really want to wait 2 hours to generate a decent sized eval clip!

I have checked out both of those repos and few chainer repos. But at this point I feel like it's worth my time to build my own up in as clear a way as possible since I believe the concept of this pseudo recurrent generative convnet with a wide receptive field could be adapted to other domains. Basically, I'd like to understand it well enough to pull out the ideas where appropriate.

Rayhane-mamah commented 6 years ago

Hello it's me, not Mario! (that wasn't funny..)

@A-Jacobson, I have tried implementing your attention here and I'm using it in the decoder here.

Just to make sure I have not done some silly mistakes:

Am i correct?

Now i remember you saying that the model learned to ignore the inputs padding at an early stage, well, mine only seems to look at the padding at such early stages.. (In the following plots, I am actually using a concatenation of the two lstm cells hidden states as query vector, even tho in the repository I am only using one of them)

28906626_1660106727419993_1876260753_n 28906980_1660156654081667_1256357912_n 29004358_1660156657415000_1893674767_n

Could you at the same time provide some alignment plots of the model while learning attention? It could really help me a lot to have an insight of what it's supposed to output in order to know when it's working properly. (Like the alignment in a few thousand steps until alignment? )

Thanks a lot!

A-Jacobson commented 6 years ago

In concept, it looks correct except that I use the last layers hidden state only as the query vector, as is common in nmt. as a warning I’m not familiar with the baseattention class you’re inheriting from in tf contrib and It’s been a while since I’ve touched tf. So I’m not likely to catch subtle bugs in your code!

As for the padding, I’m explicitly shutting off the gradient to the padding embedding in the decoder so perhaps that is a difference. It’s hard to say after only 300 steps though. Maybe of the plots from other repos like the one I referenced in this thread didn’t get any kind of alignment at all until ~20k steps. Most of my plots are only from ~3k so I have mostly checkerboards with the padding as a blur as well.

Rayhane-mamah commented 6 years ago

Yes i am aware of the base concept of attention in nmt, mostly wanted to have some key signs to look for before seeing the attention.

As of the _BaseAttentionMechanism I am typically inheriting from it to use its memory (encoder outputs) saving feature and its initial alignments initialisation.

I tried shutting off the decoder outputs were the target frames are done and the model is only going through padding but it didn't seem to affect the overall performance (audio) much so I stopped imputing the paddings to make training go a bit faster..

Also, don't you think that actually leaving the padding in the target frames with explicitely setting the equivalent token targets to 1 helps the model to better predict when to dynamically stop as it might reduce the unbalance between 0s and 1s? (Just a guess, didn't have the chance to explicitely experience it yet)

I will probably run your nmt and look out for stuff that can give signs of improvement. It's not practical to wait for hours/days just to learn that attention isn't working..

On 15 Mar 2018 22:35, "Austin" notifications@github.com wrote:

In concept, it looks correct except that I use the last layers hidden state only as the query vector, as is common in nmt. as a warning I’m not familiar with the baseattention class you’re inheriting from in tf contrib and It’s been a while since I’ve touched tf. So I’m not likely to catch subtle bugs in your code!

As for the padding, I’m explicitly shutting off the gradient to the padding embedding in the decoder so perhaps that is a difference. It’s hard to say after only 300 steps though. Maybe of the plots from other repos like the one I referenced in this thread didn’t get any kind of alignment at all until ~20k steps. Most of my plots are only from ~3k so I have mostly checkerboards with the padding as a blur as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/A-Jacobson/tacotron2/issues/1#issuecomment-373530534, or mute the thread https://github.com/notifications/unsubscribe-auth/AhFSwE6b0NneEyN1C0s_5-mEQZDxAIAYks5tet6qgaJpZM4ShIIJ .

A-Jacobson commented 6 years ago

I think it’s better to ignore the padding since it won’t always be available as a feature, it’s just an artifact of wanting to batch inputs. Learning to use padding as a feature makes the results dependent on the length of the other stuff in your batch!

Your point about the hours/days is actually the reason I made that nmt repo! It should train in ~15-30 mins. Though again I would refer you to the thread in the tacotron repo I posted above. They have a bunch of attention plots posted there. The checkerboard type alignments and magnitudes at least seem to conform to what they were getting at the start. When I was using the output of the post net as my query vector I was getting entirely different attention masks.

On Thu, Mar 15, 2018 at 3:06 PM Rayhane Mama notifications@github.com wrote:

Yes i am aware of the base concept of attention in nmt, mostly wanted to have some key signs to look for before seeing the attention.

As of the _BaseAttentionMechanism I am typically inheriting from it to use its memory (encoder outputs) saving feature and its initial alignments initialisation.

I tried shutting off the decoder outputs were the target frames are done and the model is only going through padding but it didn't seem to affect the overall performance (audio) much so I stopped imputing the paddings to make training go a bit faster..

Also, don't you think that actually leaving the padding in the target frames with explicitely setting the equivalent token targets to 1 helps the model to better predict when to dynamically stop as it might reduce the unbalance between 0s and 1s? (Just a guess, didn't have the chance to explicitely experience it yet)

I will probably run your nmt and look out for stuff that can give signs of improvement. It's not practical to wait for hours/days just to learn that attention isn't working..

On 15 Mar 2018 22:35, "Austin" notifications@github.com wrote:

In concept, it looks correct except that I use the last layers hidden state only as the query vector, as is common in nmt. as a warning I’m not familiar with the baseattention class you’re inheriting from in tf contrib and It’s been a while since I’ve touched tf. So I’m not likely to catch subtle bugs in your code!

As for the padding, I’m explicitly shutting off the gradient to the padding embedding in the decoder so perhaps that is a difference. It’s hard to say after only 300 steps though. Maybe of the plots from other repos like the one I referenced in this thread didn’t get any kind of alignment at all until ~20k steps. Most of my plots are only from ~3k so I have mostly checkerboards with the padding as a blur as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/A-Jacobson/tacotron2/issues/1#issuecomment-373530534 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AhFSwE6b0NneEyN1C0s_5-mEQZDxAIAYks5tet6qgaJpZM4ShIIJ

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/A-Jacobson/tacotron2/issues/1#issuecomment-373538143, or mute the thread https://github.com/notifications/unsubscribe-auth/AI-bzFIKIsXFbcMq8yrZ-WpcuCWvMMu8ks5teuXvgaJpZM4ShIIJ .

A-Jacobson commented 6 years ago

Plots at 4k steps, you can see that for each frame, it doesn't put weight on anything past the end token. Except in cases where there's silence (it doesn't seem to understand commas or periods yet. I also padded my spectrogram with zeros and perhaps should have used -80 which is the value of a full zero valued audio window before spectrogram extraction). on the right hand side you can see it is completely uncertain where to look when outputting zero valued spectrogram frames since 1.0 / ~140 (len of sequence) = 0.0067. That's what I meant when i said it was learning to ignore padding.

attention_4k

output: output_4k

target: target_4k

Rayhane-mamah commented 6 years ago

Now i see what you meant! I thought you were talking about the encoder padding at first (in the link you refered too earlier you can see that his model ignores the upper band of the encoder timesteps).

By the way, you could group your data by audio length and always pad on the longuest audio in a batch to reduce the size of the padding. It's a trick I saw keithito using and It probably should save you some memory and reduce the padding size a lot! I rarely notice long paddings in the spectrogram.

Finally, considering how your output values range, yes you probably should set the padding to -80. (That or normalize your data).

But the overall spectrogram looks great. Well done

On 15 Mar 2018 23:51, "Austin" notifications@github.com wrote:

Plots at 4k steps, you can see that for each frame, it doesn't put weight on anything past the end token. Except in cases where there's silence (it doesn't seem to understand commas or periods yet. I also padded my spectrogram with zeros and perhaps should have used -80 which is the value of a full zero valued audio window before spectrogram extraction). on the right hand side you can see it is completely uncertain where to look when outputting zero valued spectrogram frames since 1.0 / ~140 (len of sequence) = 0.0067. That's what I meant when i said it was learning to ignore padding.

[image: attention_4k] https://user-images.githubusercontent.com/9411532/37494884-fc5af8cc-2867-11e8-9368-11f7591108d1.png [image: output_4k] https://user-images.githubusercontent.com/9411532/37494888-ffdc7c82-2867-11e8-8eda-e1d4cf29f060.png [image: target_4k] https://user-images.githubusercontent.com/9411532/37494889-044a3e94-2868-11e8-8f43-19cd99002f9d.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/A-Jacobson/tacotron2/issues/1#issuecomment-373548015, or mute the thread https://github.com/notifications/unsubscribe-auth/AhFSwGbovOCTdsmAvDGbb2HOhmznDmXeks5tevB6gaJpZM4ShIIJ .

A-Jacobson commented 6 years ago

You're absolutely right about all of those things! Unfortunately, I'm reluctant to change a hyperparameter and reset the training yet again. Which is why I have held off. I suppose I could sort the data between training sessions but I'm really more interested in learning concepts and checking correctness than training efficiency at this point. When you started this thread I was just starting my second day working on this thing.

fatchord commented 6 years ago

@A-Jacobson I noticed you're using embeddings for the input to your wavenet - I tried that and it didn't work so well, you're better off with scalars. One-hot inputs work too but in my experience a simple scalar is the best all.

Btw - WaveRNN is coming along nicely - check out this unconditioned output - it's early enough in training too:
12k_steps.wav.tar.gz

A-Jacobson commented 6 years ago

Hah is that was random phonemes strung together sounds like? How long does the generation from a waveRNN take vs normal wavenet?

A-Jacobson commented 6 years ago

Hey @fatchord, I started to add conditioning to my wavenet but realized the tacotron2 paper asked for a 12.5 fft hop size.. which I used. Unfortunately, that means the spectrogram features have to be upsampled by ~275. Minor differences (like frames generated from an incomplete audio frame) can be handled by clipping but they claim they did the upsampling with two transposed convolutions. Of course, didn't share the parameters they used in these layers. Did you follow this same recipe or do you use a more friendly hop size with your spectrograms, or perhaps use the feature repeating strategy instead of an upsampling network?

A-Jacobson commented 6 years ago

Now that I look, they're generating audio at 24khz. Even at that rate you'd have to upsample by a factor of 300. Seems odd to try to do that in two layers. maybe stride 10 then a stride 30?

fatchord commented 6 years ago

@A-Jacobson re:wavernn - generation is around 1100 samples per second. The paper mentioned 1600 in their tensorflow regular implementation so I guess it's not too far off - the dynamic graph might be slowing it up a little. I've uploaded a public repo if you wanna check it out.

Re:spectrograms - I sampled at 22050hz with a hop-size of 256 and an FFT size of 1024. That's roughly in the same ballpark as T2's settings but at a reduced sample rate . I recommend checking out r9y9's wavenet vocoder's spectrogram preprocessing if you are unsure of anything.

A-Jacobson commented 6 years ago

I’m check out the wavernn I’ve heard good things! It’s great that you’ve got it going. I’ve been looking into parallel wavenet as well since it seems like you never have to actually sample from the vanilla wavenet to train it. The mechanics of the training look a little tricky though.

As for the r9y9 wavenet and I’ve gone through it and in general it’s great stuff! but he diverges from the paper a bit with the spectrogram preprocessing and the conditioning layers. As you said he opts for 256 hop size. He also uses 4 upsampling layers strides [4, 4, 4, 4] rather than the two they mentioned in the paper. Mine should have everything exactly as the paper except that Librosa doesn’t have a preemphasis function. It leaves me with the awkward 275 hop size, which I’m upsampling with a stride 16 then a stride 17 layer right now... On Tue, Mar 20, 2018 at 1:21 AM fatchord notifications@github.com wrote:

@A-Jacobson https://github.com/a-jacobson re:wavernn - generation is around 1100 samples per second. The paper mentioned 1600 in their tensorflow regular implementation so I guess it's not too far off - the dynamic graph might be slowing it up a little. I've uploaded a public repo if you wanna check it out.

Re:spectrograms - I sampled at 22050hz with a hop-size of 256 and an FFT size of 1024. That's roughly in the same ballpark as T2's settings but at a reduced sample rate . I recommend checking out r9y9's wavenet vocoder's spectrogram preprocessing if you are unsure of anything.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/A-Jacobson/tacotron2/issues/1#issuecomment-374512160, or mute the thread https://github.com/notifications/unsubscribe-auth/AI-bzOtTCJE_iKDQBwA3pMa6LMiI5Zjaks5tgLwJgaJpZM4ShIIJ .

fatchord commented 6 years ago

Well if you're going to follow the paper religiously then you're going to be stuck with those awkward upsampling scales. Still though, there must be a good reason why they picked them.

fatchord commented 6 years ago

Hey guys, what did you make of the latest tacotron papers? I think they're pretty amazing, the style tokens idea is great. Also this opens up the opportunity to use noisy datasets.

One of the co-authors popped up here https://www.reddit.com/r/MachineLearning/comments/87klvo/r_expressive_speech_synthesis_with_tacotron/ - definitely worth checking out. No 'tricks' held back apparently!

Rayhane-mamah commented 6 years ago

They never stop do they :p I can however confirm that no tricks are held out.. I just got my implementation working with the exact same architecture in the T2 paper.. I was just stupid to not realise there was a second version of the paper and I had to correct my understanding to attention mechanisms a little bit (then again, this is why we do such projects, to learn :p) The relevant commit will be pushed tonight.

As for the style tokens I will definitely read it in depth tonight. After just looking at the graphs, this sounds very exciting!

On Sat, 31 Mar 2018, 08:53 fatchord, notifications@github.com wrote:

Hey guys, what did you make of the latest tacotron papers? I think they're pretty amazing, the style tokens idea is great. Also this opens up the opportunity to use noisy datasets.

Also one of the co-authors popped up here https://www.reddit.com/r/MachineLearning/comments/87klvo/r_expressive_speech_synthesis_with_tacotron/

  • definitely worth checking out. No 'tricks' held back apparently!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/A-Jacobson/tacotron2/issues/1#issuecomment-377674619, or mute the thread https://github.com/notifications/unsubscribe-auth/AhFSwO4cXG340bRe_WUKsiRvi1a03c8gks5tjzXogaJpZM4ShIIJ .

fatchord commented 6 years ago

Congrats on getting T2 to work! I didn't know there was a revised paper either - must have a look now.

Rayhane-mamah commented 6 years ago

Thank you @fatchord, the revised version can be found here. I don't seem to find the original paper anywhere anymore, but hopefully I got it in pdf format here.

The difference might seem minimal between the two papers, but I really find the second version clearer.. Weird.

A-Jacobson commented 6 years ago

I like the idea of being able to generate high quality speech from noisy speech. Definitely worth a look when I (and my gpu) get some free time again. I saw that reddit thread! they were really getting grilled. Though, other than the wavenet parts I think the architecture descriptions are pretty clear. The problem of course is the use of internal data that makes it impossible completely validate our implementations. It would be nice if they at least let us know the distributions of utterance lengths in their internal data!

fatchord commented 6 years ago

I think the dataset problem may be solvable with crowdsourcing. I mean there's nothing stopping a bunch of random people on the internet picking a high-quality commercial audiobook and manually segmenting it while at the same time logging all time-stamps of the start/end of utterances. Then create a script that will segment according to these time-stamps. If enough people got involved it might only amount to a morning's worth of work per person.

That way anyone can buy the audiobook, run the script and have 20+ hours of high-quality, noise-free TTS data. All legal problems regarding distribution are avoided since the dataset contains no audio, just metadata.

I was thinking of creating a dedicated subreddit for models like wavenet, tacotron, samplernn etc called r/AudioModels and this might be a nice project to start it all off. What do you guys think?

Rayhane-mamah commented 6 years ago

@fatchord, I think it's an awesome idea if it works!

In the meantime, you can check this newly released open source speech data that can be used for TTS, speech recognition (with the add noise feature), and audio cleaning (extract speech from noisy audio). It contains several languages with multiple readers (eng-US, eng-UK, German...) and the same reader has always more than 24 hours of speech. I find it very well done and one should probably have a look at it!

fatchord commented 6 years ago

@Rayhane-mamah What a find, thanks! I just downloaded the eng_UK and the quality is really good.

fatchord commented 6 years ago

@Rayhane-mamah @A-Jacobson Hey guys, I just created https://old.reddit.com/r/AudioModels/ today if you wanna check it out.

Cheers!