[Bug] Fine tuned XTTS v2 produces strange sounds for short text

ukemamaster commented 7 months ago

Describe the bug

I have fine tuned XTTS v2 model on my own data containing both long and short audios (with the following histogram showing duration in seconds on x-axis. Labels 'old' and 'new' represent 2 datasets with long and short audios respectively.)

data_es_mix_hist

But the model produces strange sounds in case of 1-2 words text, like the following 2 examples for text='hola':

https://github.com/coqui-ai/TTS/assets/59258087/9e734e4b-3954-4adf-9919-7af42c8a28ad

https://github.com/coqui-ai/TTS/assets/59258087/f2f4b964-e1cd-4986-9f4c-d082a0a53d10

It seems like the model tries to produce at least 3 seconds audio even if the text is very short. And thus it adds some meaningless sounds to the sound of the original word in text.

@erogol Is there any way to avoid this behavior? or any parameter (may be in model args) to control this? There are gpt_start_audio_token and gpt_stop_audio_token parameters in TTS.tts.models.xtts.XttsArgs() class but i am not sure what is the impact of these parameters?

To Reproduce

N/A

Expected behavior

Should produce short audio for short text.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA A30",
            "NVIDIA A30",
            "NVIDIA A30",
            "NVIDIA A30"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.0+cu121",
        "TTS": "0.22.0",
        "numpy": "1.23.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.12",
        "version": "#64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023"
    }
}

Additional context

No response

ukemamaster commented 7 months ago

I tried several times to re-cut the data into ranges from 0.5s to 20s, guaranteeing alignment with the corresponding text. But nothing improves. There might be a difference between model args in the training recipe and in the already trained model provided.

@erogol Can you please make sure the model args provided in the training recipe are the same as your own trained model?

bensonbs commented 7 months ago

Same Issues

ukemamaster commented 7 months ago

@bensonbs Have you fine tuned the xtts-v2 model on your own dataset? Can you share a histogram of the audio lengths of your dataset? Have you tried to modify the training code or model args to avoid this?

insomnia777 commented 7 months ago

Same Issues

kaveenkumar commented 6 months ago

Same issue. Pre-trained XTTSv2 produces extra speech after the intended "text", 10-20% of the time

peterliu2023 commented 5 months ago

Same issue. The pretrained Xtts v2 generate extra speech randomly.

bensonbs commented 4 months ago

I have implemented the Diversified Perturbation Optimized (DPO) loss in TTS/tts/layers/xtts/gpt.py to improve the model's generalization ability and robustness. This implementation aims to address the issue of strange sounds occurring for short text inputs. By introducing the DPO loss, the model is expected to generate more consistent and natural-sounding audio output, even for shorter text sequences.

Code Snippet: TTS/tts/layers/xtts/gpt.py

text_logits, mel_logits = self.get_logits(
    text_emb,
    self.text_head,
    mel_emb,
    self.mel_head,
    prompt=cond_latents,
    get_attns=return_attentions,
    return_latent=return_latent,
    attn_mask_cond=attn_mask_cond,
    attn_mask_text=attn_mask_text,
    attn_mask_mel=attn_mask_mel,
)

reject_text_logits, reject_mel_logits = self.get_logits(
    text_emb,
    self.text_head,
    mel_emb,
    self.mel_head,
    prompt=cond_latents,
    get_attns=return_attentions,
    return_latent=return_latent,
    attn_mask_cond=attn_mask_cond,
    attn_mask_text=attn_mask_text,
    attn_mask_mel=attn_mask_mel,
)

text_probs = F.softmax(text_logits, dim=-1)
mel_probs = F.softmax(mel_logits, dim=-1)

loss_text_dpo = F.cross_entropy(reject_text_logits, text_probs)
loss_mel_dpo = F.cross_entropy(reject_mel_logits, mel_probs)

TTS/tts/layers/xtts/trainer/gpt_trainer.py

        loss_dict["loss_text_ce"] = loss_text * self.args.gpt_loss_text_ce_weight
        loss_dict["loss_mel_ce"] = loss_mel * self.args.gpt_loss_mel_ce_weight
        loss_dict["loss_text_dpo"] = loss_text_dpo * self.args.gpt_loss_text_ce_weight
        loss_dict["loss_mel_dpo"] = loss_mel_dpo * self.args.gpt_loss_mel_ce_weight
        loss_dict["loss"] = loss_dict["loss_text_ce"] + loss_dict["loss_mel_ce"] + loss_dict["loss_text_dpo"] + loss_dict["loss_mel_dpo"]

VRAM Usage and Training Time Comparison:
- Without DPO loss: VRAM usage: X GB Training time per epoch: Y minutes
- With DPO loss: VRAM usage: 2X GB Training time per epoch: 2Y minutes

insomnia777 commented 4 months ago

can you give me an explanation? and how to try it?

bensonbs commented 4 months ago

can you give me an explanation? and how to try it?

When the GPT-2 model generates shorter sentences, it sometimes fails to accurately produce the [STOP] token, resulting in the inclusion of peculiar sounds in the generated content. These sounds may be inconsistent as they are not explicitly guided, meaning that each generation might differ. To address this issue, during training, I compare the outputs of two generations produced under the same conditions to detect any peculiar sounds. Whether both generations contain strange sounds or only one does while the other doesn't, the model receives a penalty. This encourages it to avoid generating incoherent random content.

Methods can refer to the modifications in TTS/tts/layers/xtts/gpt.py and TTS/tts/layers/xtts/trainer/gpt_trainer.py. I am currently testing which loss function is more stable. Compared to cross entropy, MSE can more accurately eliminate abnormal sounds, but I am not sure if it is theoretically correct.

This method can only be used during fine-tuning, and when using this method, make sure that your fine-tuning dataset includes enough short audio files.

insomnia777 commented 4 months ago

can you give me an explanation? and how to try it?

When the GPT-2 model generates shorter sentences, it sometimes fails to accurately produce the [STOP] token, resulting in the inclusion of peculiar sounds in the generated content. These sounds may be inconsistent as they are not explicitly guided, meaning that each generation might differ. To address this issue, during training, I compare the outputs of two generations produced under the same conditions to detect any peculiar sounds. Whether both generations contain strange sounds or only one does while the other doesn't, the model receives a penalty. This encourages it to avoid generating incoherent random content.

Methods can refer to the modifications in TTS/tts/layers/xtts/gpt.py and TTS/tts/layers/xtts/trainer/gpt_trainer.py. I am currently testing which loss function is more stable. Compared to cross entropy, MSE can more accurately eliminate abnormal sounds, but I am not sure if it is theoretically correct.

This method can only be used during fine-tuning, and when using this method, make sure that your fine-tuning dataset includes enough short audio files.

Wouldn't it be easier to impose a penalty on the length of the generated sequence, based on median character-per-second data?

tuanh123789 commented 2 months ago

can you give me an explanation? and how to try it?

When the GPT-2 model generates shorter sentences, it sometimes fails to accurately produce the [STOP] token, resulting in the inclusion of peculiar sounds in the generated content. These sounds may be inconsistent as they are not explicitly guided, meaning that each generation might differ. To address this issue, during training, I compare the outputs of two generations produced under the same conditions to detect any peculiar sounds. Whether both generations contain strange sounds or only one does while the other doesn't, the model receives a penalty. This encourages it to avoid generating incoherent random content.

Methods can refer to the modifications in TTS/tts/layers/xtts/gpt.py and TTS/tts/layers/xtts/trainer/gpt_trainer.py. I am currently testing which loss function is more stable. Compared to cross entropy, MSE can more accurately eliminate abnormal sounds, but I am not sure if it is theoretically correct.

This method can only be used during fine-tuning, and when using this method, make sure that your fine-tuning dataset includes enough short audio files.

can you share some sample with DPO loss ?