cantabile-kwok / cantabile-kwok.github.io

1 stars 0 forks source link

code #1

Open nikita-petrashen opened 1 year ago

nikita-petrashen commented 1 year ago

Hey man, amazing work!

Do you have plans to release the code?

cantabile-kwok commented 1 year ago

Sorry for replying late since I went to ICASSP last week. Thanks for the interest in our work!

We currently do not have plans to release the code as they are in some way intertwined with other internal codebases; but if you are looking for EmoDiff, most of the implementation borrows from the official GradTTS code (https://github.com/huawei-noah/Speech-Backbones/tree/main/Grad-TTS). The key part of the soft-label guidance can be attached here:

# as a method of model.Diffusion class
    def classifier_decode_mixture(self, z, mask, mu, n_timesteps, stoc=False, spk=None, classifier_func=None, guidance=1.0, control_emo1=None,control_emo2=None, emo1_weight=None):
        # control_emo should be [B, ] tensor
        h = 1.0 / n_timesteps
        xt = z * mask
        for i in range(n_timesteps):
            t = (1.0 - (i + 0.5) * h) * torch.ones(z.shape[0], dtype=z.dtype,
                                                   device=z.device)
            time = t.unsqueeze(-1).unsqueeze(-1)
            noise_t = get_noise(time, self.beta_min, self.beta_max,
                                cumulative=False)
            # =========== classifier part ==============
            xt = xt.detach()
            xt.requires_grad_(True)
            logits = classifier_func(xt.transpose(1, 2), mu.transpose(1, 2), (mask == 1.0).squeeze(1), t=t)

            probs_every_place = torch.softmax(logits, dim=-1)  # [B, T', C]
            probs_mean = torch.mean(probs_every_place, dim=1)  # [B, C]
            probs = torch.log(probs_mean)

            control_emo_probs1 = probs[torch.arange(len(control_emo1)).to(control_emo1.device), control_emo1]
            control_emo_probs2 = probs[torch.arange(len(control_emo2)).to(control_emo2.device), control_emo2]
            control_emo_probs = control_emo_probs1 * emo1_weight + control_emo_probs2 * (1-emo1_weight)  # interpolate

            control_emo_probs.sum().backward(retain_graph=True)
            # NOTE: sum is to treat all the components as the same weight.
            xt_grad = xt.grad
            # ==========================================

            if stoc:  # adds stochastic term
                dxt_det = 0.5 * (mu - xt) - self.estimator(xt, mask, mu, t, spk) - guidance * xt_grad
                dxt_det = dxt_det * noise_t * h
                dxt_stoc = torch.randn(z.shape, dtype=z.dtype, device=z.device,
                                       requires_grad=False)
                dxt_stoc = dxt_stoc * torch.sqrt(noise_t * h)
                dxt = dxt_det + dxt_stoc
            else:
                dxt = 0.5 * (mu - xt - self.estimator(xt, mask, mu, t, spk) - guidance * xt_grad)
                dxt = dxt * noise_t * h
            xt = (xt - dxt) * mask
        return xt

As you can see it is a modification of reverse_diffusion function in the official repo. The other parts of the code may be more straightforward.

nikita-petrashen commented 1 year ago

Thanks for the reply! I'm struggling with training the classifier conditioned on t. Noised melspecs for a fixed t are classified just fine, but when I'm trying to condition the process on t and sample t during training, it starts to fall apart.

Could you share some insights?

i have tried the downward part of the U-Net used in Grad-TTS as the diffusion step estimator, but it fails.

Thank you

cantabile-kwok commented 1 year ago

I think there does not exist a requirement that the classifier should perform equally well for all t on the diffusion trajectory. For extreme cases like t=0.999, the noise is so great that intuitively any classifier should not be able to classify the corrupted sample perfectly. However, this does not impact the theory of classifier guidance, so normally the classifier performs well on small ts but not that good for large ts (if I understand your question correctly).

nikita-petrashen commented 1 year ago

I should have formulated more clearly.

When I dont condition my classifier on t, the accuracy for t = 0 is fine and decays with t. But when I add conditioning via adding positional embeddings to the feature maps, the accuracy is random for all t. I'm using essentially the downward part of the U-Net estimator from Grad-TTS with mean pooling and a classifier head in the end.

So, my question is, what structure of the classifier do you use and what is the method of conditioning it on t.

Thanks!

cantabile-kwok commented 1 year ago

I was using a stack of CNN layers (conv1d + relu + batchnorm 1d + dropout). The time information is fed in a way likely to GradTTS, where the scalar time t is fed through a Sinusoidal position embedding (SinusoidalPosEmb in https://github.com/huawei-noah/Speech-Backbones/blob/main/Grad-TTS/model/diffusion.py), and some linear layers, and added to the input of classifier. Maybe you can simplify your network structure I guess, or try to find potential bugs relating to the conditioning of t, because this should not decrease the modeling ability of your network.

nikita-petrashen commented 1 year ago

Yeah, the behaviour of my model is quite weird. Also do you classify every frame separately or the spectrogram as a whole?

Thank you

cantabile-kwok commented 1 year ago

I classify them as a whole, i.e. take average pooling of frame-wise predictions.

nikita-petrashen commented 1 year ago

Thank you for your answers! Kind regards

nikita-petrashen commented 1 year ago

Sorry, one more question. How do you blend in mu into classification?

cantabile-kwok commented 1 year ago

It is blended using the same way with time t. After some linear layers, it is added to the input of the classifier, together with time embedding.

nikita-petrashen commented 1 year ago

Hey, it's me again!

I had some progress towards replication of your results. I used the U-Net model from the Grad-TTS paper for my classifier, just added a linear layer on top of the model. (here is the code). Also I have used the code snippet for guidance that you provided. However, the quality of emotional synthesis that I got is not as good as in your demo.

Base Grad-TTS sounds good in my experiments, and the classifier is able to reach good classification accuracy (0.85on the eval set).

I have trained a speaker-conditioned Grad-TTS on Emotional Speech Dataset (no emotion conditioning), and then the classifier on the same data.

Could you please elaborate in which ways does your architecture differ from the one that I used?

Thank you very much!

cantabile-kwok commented 1 year ago

How exactly is the quality issue? I mean, is it that the synthesized audio quality is bad, or text is mispronounced, or that synthesized emotion does not correspond to the controlled one? This will help diagnose the problem.

nikita-petrashen commented 1 year ago

Sorry, I meant the synthesized emotion.

Let me elaborate: 1) The emotions start being expressive with larger gamma, however, with larger gamma the overall quality starts degrading. 2) Grad-TTS without guidance provides satisfactory quality.

So the problem is that I can't find the balance between good quality (lower gamma) and emotion expressiveness (higher gamma).

cantabile-kwok commented 1 year ago

Hmm, I guess maybe it is because the Unet structure for classifier might be too complex. The classification task is not difficult so a simple model can handle that, while a more complex one might make the gradient hard to pass through. So I would make some change in the classifier. I am using a stack of CNN layers as previously mentioned, and the parameter count is 0.43M, so something in that scale should already be enough.

nikita-petrashen commented 1 year ago

The thing that led me to the U-Net architecture is that we are trying to classify each frame separately. So, the temporal dimension of the features has to stay the same. The task is akin to semantic segmentation essentially. I'll try to go with a stack of convolutions which preserve the temporal dimension. Please verify my ideas: emo_classifier drawio (1)

I figure C should be 16 or 32

Thank you very much!