Doubt on use of discretized MoL in sampling and loss calculation

Hi all, first thanks @Rayhane-mamah for fixing bugs in wavenet vocoder and making it fully work now :) I've spent several days looking into its implementation and there's a part that really makes me struggled—the implementation of discretized MoL in sampling and loss calculation. I know it is sourced from the official implementation of PixelCNN++, but that does not help much though. Seems like I'm lacking in certain mathematical knowledge. Wonder if anyone can help me out? Thanks.

Hello @StevenZYj, thanks for reaching out! Also sorry for being late into make this answer, I didn't find much free time to make a proper legendary comment.. :)

I know the feeling, MoL is a pain to understand haha.. no problem though, let's go through this step by step (with the assistance of Gaussian distribution, things tend to become easier).

Before we get into this, I want to apologize in advance for any typos or any lack of attention mistakes, most of what will follow comes from my personal thinking and observations. While I tried covering most important stuff, I am always open to any improvements to my modest comment! Please sit back and Enjoy :)

Let's start with easy stuff and move to more complicated levels as we proceed:

Sampling: In most high resolution applications where the model also happens to be recursive (calls itself recursively on time moving inputs), you will most of the time find that sampling randomly from some output distribution always works better than other approaches. (Random sampling with softmax probability instead of picking the argmax when using 'quantize-wavenet'). MoL or gaussian sampling are not much different. The only core difference is that we don't do multinomial sampling over N classes using N probabilities, instead we sample either from a Gaussian or a mixture of logistics.
- Gaussian sampling: gaussian distribution As you may already know, the Gaussian distribution is defined using a mean $\mu$ and a scale (std) $\sigma$ ( or squared scale (variance) $\sigma^{2}$ ).
Now, if I had to make a neural network that always samples from Gaussian distributions, the most natural way to think of it would be to make the NN predict the mean and scale then use those parameters to sample from a normal distribution with the predicted mean and scale. This is widely used in VAEs to create latent Gaussian spaces for future sampling.

So, at synthesis times (we will get to training later), the network predicts a mean and a scale, we then use those two predictions to pick a sample from the equivalent Gaussian distribution. For numerical stability and because TF doesn't have explicit efficient ways to force our scale output to be positive, we instead predict $\log(\sigma)$ and follow it with an exponential. We construct the Gaussian Distribution, pick a random sample from it and clip the prediction with [-1, 1] because we suppose audio is scaled to [-1, 1]. Code can be found in gaussian.py. Simple enough?
- Mixture of Logistics sampling: Logistic distribution Like the gaussian distribution, Logistic distributions have a loc (mean) $\mu$ and a scale s (yup, a simple s this time :) s > 0 ).
But, what does mixture stand for then? here's where things become interesting really. A mixture of logistics (MoL) is a set of M (this is 10 in our experiments) Logistic distributions, each with its own loc and scale. For that purpose we add a third parameters for every Logistic distribution ( $\pi$ ) which will determine the probability of a distribution to be picked. With that, the outputs of the model should be M * 3 (30 in our experiments).

So, at synthesis times, the network first picks which Logisitc distribution (out of M) to use, this is done by sampling from softmax using the probabilities of distributions (logit_probs in the code). I know these lines are probably confusing: https://github.com/Rayhane-mamah/Tacotron-2/blob/d13dbba16f0a434843916b5a8647a42fe34544f5/wavenet_vocoder/models/mixture.py#L90-L93

It turns out there is something called Gumbel-max trick which mimics a softmax sampling. So basically, the idea here is that we apply a softmax-like sampling to select which Logistic distribution to use.

Now to sampling, for MoL, we use something called inversion sampling. I always see people asking why we switch to CDF (sigmoid) when using Logistic distribution, it actually makes training and sampling from the distribution very easy. Let's first consider the CDF of the Logistic distribution equation (1):

$y = \frac{1}{1 + e^{-\frac{(x - \mu)}{s}}}$

The main idea of sampling is that we want to pick a random variable X that is most probable to be picked using the Distribution $\L(\mu, s)$ . The idea of inverse sampling is to do that but using the CDF formula. i.e: we want to pick a random variable X that is most probable under the Logistic PDF. By doing some simple calculation we can indeed find the equation (quantile function (2)):

$X = \mu + s\ \log(\frac{y}{1 - y})$

In practice, we select a random uniform y in (1e-5, 1 - 1e-5) to avoid saturation regions of the sigmoid, then determine x from it. If things got a bit too complicated at this point, use Logistic distribution link above for plot assistance. Naturally, it is evident to notice that if our randomly picked y=0.5, x will be exactly the mean of the Logistic distribution, which is logic. Lucky for us, by picking a y between 1e-5 and 1 - 1e-5 we can actually get a x that is most probable for those mean and scale parameters according to PDF.

Perfect, we got how sampling is done at synthesis, now how do I train my model to make accurate predictions for these means and scales? and in a mixture case, how does the model learn to correctly select the Distribution to use? This brings us to the second part of this long comment :)

Training: To train sampling layers in neural networks (sometimes called parametrization layers), the objective is to have some output distribution parameters that maximize the likelihood of a target y to be sampled from such distribution. i.e: we want to have the distribution parameters that maximize the probability that a real target y is drawn from this distribution (I repeat this a lot because this is all what the following revolves around :) ). It'll get clearer in a minute.
- Training a Gaussian distribution output: Once again, let's start with a single Gaussian distribution because that's easier :) Let's think about this in a naive way, if I want to make a Gaussian distribution that has maximum chance of drawing y from I would pick a mean $\mu = y$ and the smallest $\sigma$ possible. What I'm attempting to achieve naively here, is called Maximum Likelihood Estimation (MLE). How likely is the target y drawn from $N(\mu, \sigma^{2})$ . As inferred in its name, for our wavenet model, it is as simple as maximizing $P_{\mu; \sigma}(y|x_{<t};c)$ (In Wavenet related papers, y is usually called $x_{t}$ . I use y to make sure the information of y being the ground truth sample is there. :)
Because maximizing a function f is the equivalent of minimizing its opposite -f, we minimize $- \log(P_{\mu; \sigma}(y|x_{<t};c))$ . The log is applied to have a parabolic shape with the maximum (or minimum if we're talking about -f) as a peak.

So the procedure is very simple, we predict $\mu$ and $\log(\sigma)$ , apply exponential to get $\sigma$ , make the normal (Gaussian) distribution out of that, compute the log probability of a target y to fall inside the predicted Gaussian. After that we simply take the opposite of the log probability for all samples, take the average and minimize the loss function. Because very accurate mean predictions can results in a very small scale predictions, we limit the log_scale with a minimal bound to prevent it from going to infinity (and beyond! :) ).

Now, one might wonder: how on earth is this even related to MoL loss? Well, there's something I didn't tell you, we can actually reformulate the MLE loss to use CDF for training. This part is usually tough to explain with simple words or formulas so let's get some help from some plots shall we? :)

Here's the procedure:
- Create a Gaussian distribution with our predicted mean and standard deviation (std).
- make an "envelope" around the real sample y, with the borders of this envelop being y + 0.5 and y - 0.5. Because we scale our y with a factor of 2/(2**16 - 1) when preparing our data to make it in [-1, 1], we apply the same scaling for 0.5, thus the envelope borders become $y + \epsilon$ and $y - \epsilon$ with $\epsilon = 0.5\ \frac{2}{2^{16} - 1} = \frac{1}{2^{16} - 1}$ .
- Finally, maximize the distance between this "envelope" borders by minimizing its negative log as usual.
One might ask, why?? well, this is the part that is easily explained with plots :) (Don't ask me how people got this idea though, I believe some kind of sorcery has been used..)

pdf_and_cdf

In the figure above, I have made two plots (PDF on the top, CDF on the bottom) for 3 different normal distributions with their respective means and variances as legend. I also made 4 dashed lines, 2 blacks and 2 blues. The 2 blue dashed lines are on top of each other so they appear as one. They refer to x_cdf+ and x_cdf- (explanation on the way). Blue lines follow the real plot scale, that's reality. For the sake of explanation, I break the plot scale rules by a factor of 1000x to create the black dashed lines which are the not real x-cdf+ and x_cdf-. The intersection of x_cdf+ with the distribution CDF gives cdf+. Same applies on cdf- side. cdf+ and cdf- are what I call the "envelope" borders.

Alright let's stick with the CDF for a while. In the figure, I suppose that y = 0 which is represented by the blue dashed line. Now, if we consider x_cdf+ and x_cdf- to be the two dashed black lines, how can we maximize the difference between cdf+ and cdf- (i.e: maximize cdf+ - cdf-) while x_cdf+ and x_cdf- are kept constant (they only depend on the true value of the target y and the model has no control over them)? Actually, by having a closer look to CDF of Gaussian distribution, we notice that the maximal value for cdf+ - cdf- is obtained with the red Gaussian for two reasons:

- The mean of the Gaussian is exactly on top of the real y value. In contrast to the Blue Gaussian which is way off, thus cdf+ and cdf- for the blue distribution are about the same. (saturation region of the CDF).

- The slope of the CDF gets bigger when $\sigma ^ {2}$ gets smaller. For $\sigma > 0$ , this means that the CDF slope is bigger as long as $\sigma$ gets smaller as well. Since log function is an increasing function, the slope is also bigger when $\log(\sigma)$ gets smaller. As usual, we prevent this value from going to infinity by adding a lower bound.

Now, if we come back to the real plot scale, maximizing the difference between cdf+ and cdf- is really maximizing the slope at the point of x-coordinate y. Now, it is well known that maximal slope for CDF is hit exactly in its inflection point (inflection point: $f''(x) = 0$ , which means the derivative of f is at its maximum, thus maximal slope (or minimal if the function is decreasing)).

So, just like that, this function (difference between cdf+ and cdf-, does this function not have a name? am I not aware of it?) not only hits its maximum when $\mu = y$ but also when $\sigma$ is the smallest possible (the smaller the scale, the higher the value of the slope). Next, to simply make this a loss function we simply apply a log and a minus sign.

If we want to make a reference to PDF, we indeed notice that the bigger the difference between cdf+ and cdf-, the more the distribution has a probability of picking the real target y (Biggest value is hit for the *red gaussian). Thus this reformulation of the loss function, is still in principle, consistent with the MLE we discussed earlier. (Maximize the probability that a real sample y is drawn from an output distribution $N(\mu, \sigma^{2})$ . ;) ).

EDIT: If you have some mathematical background, you can understand this as follows (thanks to @m-toman for pointing that out): While keeping in mind that PDF is the derivative of CDF, the trick is to, instead of maximizing the PDF for some target y directly (getting the best distribution that has maximum probability of drawing y), we instead go to the integral (CDF) and maximize its derivative around y. Naturally we wouldn't explicitly be maximizing the derivative itself as it would bring us to PDF.. instead we approximate the derivative around y with a slope approximation (cdf+ - cdf- = cdf_delta which is a good approximation of the derivatives when cdf+ and cdf- are close to each other like in our case). Thus, maximizing the slope (or the derivative) is exactly equivalent to finding the maximum of the derivative function for PDF around y. (This is true as long as the CDF has an S shape like the sigmoid, as the distribution is unimodal, like Gaussian, Student, Logisitc..) Let's not forget we act on the mean and scale of the distribution, so technically wavenet is going through the complex process of determining the best suiting Gaussian (or MoL) distribution that draws a target y from previous samples x and conditioning c. Genius idea right _

That is the base of MoL loss as well. Once this has been assimilated without problems, MoL loss is really a variation with small details. In the next section, I will solely be focusing on those details. :)

Train MoL: The MoL loss is really devided in two parts:
- Train the distribution selection
- Train the distribution itself (How likely is the distribution to hold the target y)

Let's start with the distribution training as it's a continuation to what has been discussed in the previous section:

Distribution loss: There's really not much to say here that haven't already been said earlier. the following lines are simply computing the logistic CDF using the equation (1) we presented earlier: https://github.com/Rayhane-mamah/Tacotron-2/blob/d13dbba16f0a434843916b5a8647a42fe34544f5/wavenet_vocoder/models/mixture.py#L44-L49

Then comes the corners and middle special cases treated in PixelCNN++ that are done in the following code (where I shamelessly didn't even change the comments..). To understand what's going on there, please have a look at these softmax-sigmoid relations: https://github.com/Rayhane-mamah/Tacotron-2/blob/d13dbba16f0a434843916b5a8647a42fe34544f5/wavenet_vocoder/models/mixture.py#L51-L60
Distribution selection loss: (This is my perspective, to verify with other sources) Finally, because the model is using a mixture of distributions instead of a single Logistic distribution, we need to also define a loss term that encourages the model to select not only one of the distributions to use depending on the situation, but actually pick the best distribution from the existing ones (for the situation). Let me explain:

After computing the probabilities cdf+ - cdf- that I will call cdf_delta, we compute a second term which is a simple softmax over the distributions probabilities ( $\pi$ ) and compute the new probabilities: $probs = cdf\_delta * softmax(\pi)$ . As you probably already know, the softmax returns a vector of floats in [0, 1] where these floats always sum up to 1. So let's take a while to consider that we have 10 Logistic distributions where 9 of them make bad predictions and only 1 of them makes an accurate prediction. If the model assigns about the same weights $\pi_i$ for every distribution, all "gains" would be multiplied by a factor of 0.1 (output of softmax). When summing the result of this multiplication, such "gain" would be much worse than multiplying the good Logistic cdf_delta with 1 and multiply all the bad cdf_deltas with 0. Thus the model is encouraged into picking the most promising Logistic. This is the core idea. Let's look at the code.

What I previously discussed is in the lines that follow: https://github.com/Rayhane-mamah/Tacotron-2/blob/d13dbba16f0a434843916b5a8647a42fe34544f5/wavenet_vocoder/models/mixture.py#L69-L74

It is worth noting that we use log for numerical stability. It is also worth noting the property of log: $\log(probs\ *\ softmax(\pi)) = \log(probs)\ +\ \log(softmax(\pi))$ . The usage of log_sum_exp instead of a simple log_sum is thus also explained.

There you have it, I assume that's all there is to know on MoL (and Gaussian distribution sampling for WaveNet). I hope this modest comment helped you get the main intuition to actually be able to go through the code and feel like you know what's going on.

Everyone is invited to add anything I may have missed or discuss anything I explained wrong. ;)

In any case, here are some useful references I found when trying to understand MoL/Gaussian myself in case anyone wants to have some more reading:

@Rayhane-mamah That is a legendary comment for sure! I cannot say how much I appreciate it :D I've literally spent most of my day reading through, searching and understanding these ideas. It is indeed a pain haha, but at the same time I feel like I've learned a lot.

@StevenZYj thanks a lot! :)

I know that sometimes I talk about stuff as it's evident.. (I saw most of the involved mathematics in college so I tend to skip some details supposing they're well known..) Please don't hate me for that x)

If you find anything not explained well enough, please feel free to ask any questions, I'll do my best :) Of course if you also happen to find any additional information, please feel free to share with us!

Wow, I'll also check out this amazing comment later on. Thanks for that.

EDIT1: Started reading the first paragraphs and I'd like to comment whenever I think it makes sense ;):

"model also happens to be recursive" - I guess you are referring to the auto-regressive nature of wavenet?
"name of (cdf+ - cdf-)" - I interpreted the plots like this: f(x + e) - f(x - e) is an approximation of the derivative f'(x). The PDF is the derivative of the CDF (and vice versa: the CDF the integral of PDF). So maximizing (cdf+ - cdf-) gives you the maximum of the pdf (in the case of the unimodal Gaussians/Logistic PDFs anyway)... right?

Hey @m-toman

yes I am referring to the auto-regressive nature of wavenet.. :)
From an analytical point of view, that's exactly it yes! I don't know how I didn't get the idea to explain it as easily as you did haha.. when I first got a basic idea of MoL loss I instantly saw the idea of maximizing the slope (to maximize the derivative of CDF at the point of x-abscisse of value y, hopefully CDF derivative happens to be PDF, thus maximizing the slope is literally finding the best distribution where y has a maximal chance of being drawn). I still to this day find it amazing how this pushes the model to both correct the mean and minimize the scale :) Plus working with CDF is computationally more efficient than PDF for MoL. The guy who came up with the idea is surely to be a wizard :)

@m-toman you seem to have great ways of simplifying things, I look forward to your future notes!

@Rayhane-mamah Thanks, I've read it through now and the rest sounded pretty straightforward. I think your explanation of the CDF thingy is better for someone with less background knowledge, as mine makes a bit more more assumptions. Regarding the wizardness, first time I read the Tacotron paper I thought: how the hell do they come up with all those architectures? Of course, the use of the encoder/decoder-attention is clear (using seq2seq as a replacement for more classic duration models), but how they came up with something like the CBHG eludes me. Deep Learning often seems like a bit too much trial & error to me, coming from a more traditional speech synthesis background where things are a bit more... transparent and deterministic (although in reality a decision tree with thousands of nodes in the classic HMM-based systems isn't that human-readable either...). But I'm still excited how a single network replaces hundred-thousands of Festival/Festvox/whatever code and 120 individual steps in some HMM training scripts ;).

Back to the topic, a couple hints that hopefully help others:

I think one important realization is the difference between a mixture distribution (https://en.wikipedia.org/wiki/Mixture_distribution) as described above, and a mixture model (https://en.wikipedia.org/wiki/Mixture_model). Gaussian as an example again: A Gaussian Mixture Model (GMM) represents a single probability distribution, with the Gaussian components as weighted sum to form the final PDF - i.e. sum(w_i + N(mean_i, var_i)). Whereas the mixture distribution is actually multiple separate distributions where one is picked as a first step and then sampled in a second step.
Softmax regression is also called multi-class logistic regression (https://www.kdnuggets.com/2016/07/softmax-regression-related-logistic-regression.html), that's why in a neural network with a final softmax layer is often considered a logistic regression (and also used here for distribution selection).
From the Gumble-max article you linked I suspect the following line is most important to understand the code: "add Gumbel noise to each x_k and then take the argmax." Intuitively this sounds to me like: "in a Gaussian, the mean is the most likely data point, to do sampling let's pick values close to it."
Sampling instead of picking argmax to me seems like: instead of interpreting the output softmax probabilities as probabilities for a given data point x to be of class c (as commonly referred to in softmax classification tutorials), the softmax probabilities form a discrete probability distribution (parameterized by x?) from which most likely classes are sampled. Not sure why this works better though.

@m-toman yeah it sure feels like trial and error that come from intuitions (or a strong mathematical research? e.g: parallel wavenet doesn't seem to be intuition based that much.. )

Thanks for bringing that up, I had that confusion in the past haha..
Nothing to add on that x)
I actually don't quite get your point here, the noise is added to the distribution probabilities vector and the distribution with highest noise_prob is picked for sampling. so I believe x_k is really pi_i in our case. am I missing something?
and yeah I totally believe that's the idea, it would be good to know why this performs better than a hard max picking though x) Just from a personal though I would say it has to do with how "smooth" the overall samples will fit together. Actually you can take my quantized wavenet and apply hard argmax during sampling, audio will sound very rough.. with MoL and Gaussian a simple observation on the means and target audio (using tensorboard) shows that means aren't perfectly fitting the targets. Always hard picking means shouldn't give good results.. I never thought I would be so interested in learning more about mathematics before starting Machine Learning haha.. :) And that's the beauty of it isn't it? we teach machines to solve problems, but at the same time they encourage us to teach ourselves more so that we can keep on improving them x)

Hi @Rayhane-mamah, Thank you for fixing bugs in wavenet vocoder. To train both models sequentially use: python train.py --model='Tacotron-2', The results of the loss function wavenet are as follows:

[2018-08-16 11:45:10.851] Step [2018-08-16 11:45:13.820] Step [2018-08-16 11:45:16.799] Step [2018-08-16 11:45:19.958] Step [2018-08-16 11:45:23.253] Step [2018-08-16 11:45:26.504] Step [2018-08-16 11:45:29.797] Step [2018-08-16 11:45:33.066] Step [2018-08-16 11:45:36.283] Step [2018-08-16 11:45:39.526] Step [2018-08-16 11:45:42.807] Step [2018-08-16 11:45:46.157] Step [2018-08-16 11:45:49.488] Step [2018-08-16 11:45:52.746] Step [2018-08-16 11:45:55.952] Step [2018-08-16 11:45:59.138] Step [2018-08-16 11:46:02.332] Step [2018-08-16 11:46:05.524] Step [2018-08-16 11:46:08.719] Step [2018-08-16 11:46:11.915] Step [2018-08-16 11:46:15.112] Step [2018-08-16 11:46:18.307] Step [2018-08-16 11:46:21.502] Step [2018-08-16 11:46:21.611]
Generated 32 train batches [2018-08-16 11:46:24.699] Step [2018-08-16 11:46:27.897] Step [2018-08-16 11:46:31.090] Step [2018-08-16 11:46:34.288] Step [2018-08-16 11:46:37.480] Step [2018-08-16 11:46:40.676] Step [2018-08-16 11:46:43.868] Step [2018-08-16 11:46:47.057] Step [2018-08-16 11:46:50.250] Step [2018-08-16 11:46:53.442] Step [2018-08-16 11:46:56.641] Step [2018-08-16 11:46:59.835] Step [2018-08-16 11:47:03.029] Step [2018-08-16 11:47:06.226] Step [2018-08-16 11:47:09.421] Step [2018-08-16 11:47:12.609] Step [2018-08-16 11:47:15.801] Step [2018-08-16 11:47:18.995] Step [2018-08-16 11:47:22.184] Step [2018-08-16 11:47:25.382] Step [2018-08-16 11:47:28.576] Step [2018-08-16 11:47:31.767] Step [2018-08-16 11:47:34.960] Step [2018-08-16 11:47:38.233] Step [2018-08-16 11:47:41.457] Step [2018-08-16 11:47:44.656] Step [2018-08-16 11:47:47.876] Step 1 [14.957 sec/step, loss=1.17336, avg_loss=1.17336] 2 [8.963 sec/step, loss=0.67372, avg_loss=0.92354] 3 [6.968 sec/step, loss=0.25406, avg_loss=0.70038] 4 [6.016 sec/step, loss=0.05745, avg_loss=0.53965] 5 [5.472 sec/step, loss=-0.32939, avg_loss=0.36584] 6 [5.101 sec/step, loss=-0.45899, avg_loss=0.22837] 7 [4.843 sec/step, loss=-0.27051, avg_loss=0.15710] 8 [4.646 sec/step, loss=-0.14045, avg_loss=0.11990] 9 [4.487 sec/step, loss=-0.67957, avg_loss=0.03107] 10 [4.363 sec/step, loss=-0.44299, avg_loss=-0.01633] 11 [4.265 sec/step, loss=-0.49541, avg_loss=-0.05988] 12 [4.188 sec/step, loss=-0.80715, avg_loss=-0.12216] 13 [4.122 sec/step, loss=-0.55638, avg_loss=-0.15556] 14 [4.061 sec/step, loss=-0.72860, avg_loss=-0.19649] 15 [4.004 sec/step, loss=-0.46490, avg_loss=-0.21438] 16 [3.952 sec/step, loss=-0.76029, avg_loss=-0.24850] 17 [3.908 sec/step, loss=-0.92684, avg_loss=-0.28841] 18 [3.868 sec/step, loss=-0.77093, avg_loss=-0.31521] 19 [3.833 sec/step, loss=-0.07051, avg_loss=-0.30233] 20 [3.801 sec/step, loss=-0.22756, avg_loss=-0.29859] 21 [3.772 sec/step, loss=-0.92122, avg_loss=-0.32824] 22 [3.746 sec/step, loss=-0.83789, avg_loss=-0.35141] 23 [3.722 sec/step, loss=-0.73162, avg_loss=-0.36794] of size 3 in 0.107 sec 24 [3.700 sec/step, loss=-0.70173, avg_loss=-0.38185] 25 [3.680 sec/step, loss=-0.64619, avg_loss=-0.39242] 26 [3.661 sec/step, loss=-0.94044, avg_loss=-0.41350] 27 [3.644 sec/step, loss=-0.73025, avg_loss=-0.42523] 28 [3.628 sec/step, loss=-0.69240, avg_loss=-0.43477] 29 [3.613 sec/step, loss=-0.62764, avg_loss=-0.44142] 30 [3.599 sec/step, loss=-0.61241, avg_loss=-0.44712] 31 [3.586 sec/step, loss=-0.73374, avg_loss=-0.45637] 32 [3.573 sec/step, loss=-0.84518, avg_loss=-0.46852] 33 [3.562 sec/step, loss=-0.74587, avg_loss=-0.47692] 34 [3.551 sec/step, loss=-0.72382, avg_loss=-0.48418] 35 [3.541 sec/step, loss=-1.03929, avg_loss=-0.50004] 36 [3.531 sec/step, loss=-1.12073, avg_loss=-0.51729] 37 [3.522 sec/step, loss=-0.23181, avg_loss=-0.50957] 38 [3.514 sec/step, loss=-0.56894, avg_loss=-0.51113] 39 [3.505 sec/step, loss=-0.46203, avg_loss=-0.50987] 40 [3.497 sec/step, loss=-0.68721, avg_loss=-0.51431] 41 [3.490 sec/step, loss=-0.74064, avg_loss=-0.51983] 42 [3.483 sec/step, loss=-0.69302, avg_loss=-0.52395] 43 [3.476 sec/step, loss=-1.06822, avg_loss=-0.53661] 44 [3.470 sec/step, loss=-0.70099, avg_loss=-0.54034] 45 [3.464 sec/step, loss=-0.85478, avg_loss=-0.54733] 46 [3.458 sec/step, loss=-0.66613, avg_loss=-0.54991] 47 [3.454 sec/step, loss=-0.89251, avg_loss=-0.55720] 48 [3.449 sec/step, loss=-0.77377, avg_loss=-0.56172] 49 [3.444 sec/step, loss=-0.78088, avg_loss=-0.56619] 50 [3.439 sec/step, loss=-0.65642, avg_loss=-0.56799]

Are the negative values in Loss correct?

Hey @atreyas313, yes that is normal assuming you are using "raw" with 2 output channels (which uses a single gaussian distribution). As explained in my first comment, we minimize the negative log probability of y. With good predictions this probability gets bigger and the loss gets slower leading to even smaller loss (bigger absolute value under 0). So yeah that's normal :)

If you prefer to use MoL instead, change the output_channels parameter to be M * 3 where M is your chosen number of Logistic distributions (usually 10)

Hey @Rayhane-mamah

scale our y with a factor of 2/(2**16 - 1)

Sorry for my silly question, Does this scaling takes place at librosa.load(...) function or do we have to scale it manually?. The only scaling written in the code is wav = wav / np.abs(wav).max() * hparams.rescaling_max

I do not think rescaling is a good idea for preprocessing for there might be exceptional peak in some corpus. Therefore the outputs of rescaling might well be abnormal for training.

@begeekmyfriend Hi, what do you mean by "rescaling"? Could you point out where it is embodied in the preprocess.py?

if the audio is scaled to [-2, 2] rather than [-1, 1], so whether what should I do is just clip the sampled prediction with [-2, 2] ? Need any modify in discretized MoL loss file mixture.py ?

Hi @Rayhane-mamah Sorry for bother you, but the post URL of the Gumbel-Max Trick you linked above has changed.

https://lips.cs.princeton.edu/the-gumbel-max-trick-for-discrete-distributions/

This thread has been amazing.

Could we borrow some ideas behind MOL for making a mixture of gaussians to model the output distribution?... I ask because I haven't had success with MOL for my problem.

Hey @Rayhane-mamah

scale our y with a factor of 2/(2**16 - 1)

Sorry for my silly question, Does this scaling takes place at librosa.load(...) function or do we have to scale it manually?. The only scaling written in the code is wav = wav / np.abs(wav).max() * hparams.rescaling_max

what the mean of a factor of 2/(2**16 - 1)

mid_in = inv_stdv * centered_y

log probability in the center of the bin, to be used in extreme cases

(not actually used in this code)

log_pdf_mid = mid_in - log_scales - 2. * tf.nn.softplus(mid_in)

I think the Log(Logistic_pdf) = -mid_in - log_scales - 2. * softplus(-mid_in), anyone can help me understand it ?

Hi @Rayhane-mamah , thanks for your legendary answer! While I have more or less grasped your ideas, I have another question that has bothered me for days: why use an approximation of pdf in the first place during training? My guess for the MoL case is that it leads to more straightforward formulation as CDF for logistic distribution is easier to calculate than PDF. But what about the gaussian case? Why not directly use PDF to calculate the MLE loss?

Hello! Thank you for your amazing post but i still have some questions to think about.

As I know, you sourced discretized_mix_logistic_loss function from the official implementation of PixelCNN++ but you don't update means when counting log prob as in the original code.

I want to know why did you decide to take them off?

Hi @Rayhane-mamah , really nice work, and thanks for the explanations so far.

I have a further question about the training of the distributions and would be glad if you can help me. So with the cdf_delta you basically decide which distribution to choose. But what does that actually mean during backpropagation, especially in terms of the predicting layers for the mean, scale, and logit_probs? After all, the distributions must be influenced, otherwise, all distributions would converge to the "optimum" or not?

Thank you for your time.

Rayhane-mamah / Tacotron-2

Doubt on use of discretized MoL in sampling and loss calculation #155

log probability in the center of the bin, to be used in extreme cases

(not actually used in this code)