KdaiP / StableTTS

Next-generation TTS model using flow-matching and DiT, inspired by Stable Diffusion 3
MIT License
365 stars 42 forks source link

TensorBoard to visualize the MEL and audio ? #25

Open lpscr opened 2 months ago

lpscr commented 2 months ago

Hi @KdaiP

I’m trying to add TensorBoard to visualize the MEL and audio as shown below. You can play back the audio to see the epoch.

I managed to get it working, but there is a lot of noise if you see in the end in the audio i mark with red rectagle when played, making it very difficult to listen to. How can I remove this noise? Is it related to the reference and original MEL?

Here is the code I’m using for training:

code i use in train

for epoch in range(current_epoch, train_config.num_epochs):  # loop over the train_dataset multiple times
     ...
     mels = model.module.synthesise(x, x_lengths, 25, 1.0, y, 1.0, "euler", 3.0)['decoder_outputs']

image

KdaiP commented 2 months ago

Hi, could it be the padding part of audio?

If validation data is taken from a batch, StableTTS does not sort the sequences in the batch in descending order of length like the official Vits source code. Therefore, there may be padding parts later on.