KdaiP / StableTTS

Next-generation TTS model using flow-matching and DiT, inspired by Stable Diffusion 3
MIT License
290 stars 31 forks source link

Is there a trick to stabilize training? #7

Closed Flux9665 closed 3 months ago

Flux9665 commented 3 months ago

Hi! Thanks for open-sourcing your work! I like the idea, so I tried copying the CFM decoder and use it in my TTS setup. First I had some issues with NaN values after the attention in the estimator was computed for all the values that were padding. I fixed this issue using masked_fill with the x_mask instead of just multiplying with the x_mask, although I'm not sure why that was necessary for me.

But now that I could run a forward pass and calculate the CFM loss and it was not NaN, I thought it's good to go. However more problems came up. No matter what I tried, the CFM decoder would always produce a couple of NaN values after its first update during training. Do you have any idea what could cause this? Is there any specific thing I need to do to stabilize the training? The setup I am using works very well for lots of other architectures, including e.g. the normalizing flow decoder of PortaSpeech, which seems pretty similar.

KdaiP commented 3 months ago

Thank you for your interest about StableTTS. In x_mask, "1" indicates that the element is included in the attention calculation, whereas "0" means that the element does not participate in the attention calculation. So the padded element is set to "0" in x_mask. Could it be possible that your mask is reversed? For example, where "1" represents a padded element.

Flux9665 commented 3 months ago

The mask is not reversed, I am however using a boolean mask instead of a float mask. But I could solve it with the masked_fill method, so that's no longer the problem, just something I found a bit strange. The problem is that after one training update, everything in the CFM decoder turns to NaN, even with a very small learning rate and gradient clipping. Do you have any idea what could cause this?

KdaiP commented 3 months ago

The mask is not reversed, I am however using a boolean mask instead of a float mask. But I could solve it with the masked_fill method, so that's no longer the problem, just something I found a bit strange. The problem is that after one training update, everything in the CFM decoder turns to NaN, even with a very small learning rate and gradient clipping. Do you have any idea what could cause this?

Try converting the boolean mask into a floating-point mask, for example, attn_mask.to(x.dtype). This might solve the issue.

In PyTorch's attention calculations, it can sometimes be confusing as the same mask in boolean format and floating point format can have dramatically different effects. For instance, in StableTTS, converting the attention_mask to boolean could lead to NaN issues because it makes attention only to focus on padding value.

Flux9665 commented 3 months ago

You're right, it works now, thank you!

I thought a boolean tensor and a float tensor should be equivalent when they are used like this, but apparrently there is a difference.