Open StevenZYj opened 6 years ago
Hello @StevenZYj, thanks for reaching out! Also sorry for being late into make this answer, I didn't find much free time to make a proper legendary comment.. :)
I know the feeling, MoL is a pain to understand haha.. no problem though, let's go through this step by step (with the assistance of Gaussian distribution, things tend to become easier).
Before we get into this, I want to apologize in advance for any typos or any lack of attention mistakes, most of what will follow comes from my personal thinking and observations. While I tried covering most important stuff, I am always open to any improvements to my modest comment! Please sit back and Enjoy :)
Let's start with easy stuff and move to more complicated levels as we proceed:
Sampling: In most high resolution applications where the model also happens to be recursive (calls itself recursively on time moving inputs), you will most of the time find that sampling randomly from some output distribution always works better than other approaches. (Random sampling with softmax probability instead of picking the argmax when using 'quantize-wavenet'). MoL or gaussian sampling are not much different. The only core difference is that we don't do multinomial sampling over N classes using N probabilities, instead we sample either from a Gaussian or a mixture of logistics.
Now, if I had to make a neural network that always samples from Gaussian distributions, the most natural way to think of it would be to make the NN predict the mean and scale then use those parameters to sample from a normal distribution with the predicted mean and scale. This is widely used in VAEs to create latent Gaussian spaces for future sampling.
So, at synthesis times (we will get to training later), the network predicts a mean and a scale, we then use those two predictions to pick a sample from the equivalent Gaussian distribution. For numerical stability and because TF doesn't have explicit efficient ways to force our scale output to be positive, we instead predict and follow it with an exponential. We construct the Gaussian Distribution, pick a random sample from it and clip the prediction with [-1, 1] because we suppose audio is scaled to [-1, 1]. Code can be found in gaussian.py. Simple enough?
s > 0
). But, what does mixture stand for then? here's where things become interesting really. A mixture of logistics (MoL) is a set of M (this is 10 in our experiments) Logistic distributions, each with its own loc and scale. For that purpose we add a third parameters for every Logistic distribution () which will determine the probability of a distribution to be picked. With that, the outputs of the model should be M * 3
(30 in our experiments).
So, at synthesis times, the network first picks which Logisitc distribution (out of M) to use, this is done by sampling from softmax using the probabilities of distributions (logit_probs
in the code). I know these lines are probably confusing:
https://github.com/Rayhane-mamah/Tacotron-2/blob/d13dbba16f0a434843916b5a8647a42fe34544f5/wavenet_vocoder/models/mixture.py#L90-L93
It turns out there is something called Gumbel-max trick which mimics a softmax sampling. So basically, the idea here is that we apply a softmax-like sampling to select which Logistic distribution to use.
Now to sampling, for MoL, we use something called inversion sampling. I always see people asking why we switch to CDF (sigmoid) when using Logistic distribution, it actually makes training and sampling from the distribution very easy. Let's first consider the CDF of the Logistic distribution equation (1):
The main idea of sampling is that we want to pick a random variable X that is most probable to be picked using the Distribution . The idea of inverse sampling is to do that but using the CDF formula. i.e: we want to pick a random variable X that is most probable under the Logistic PDF. By doing some simple calculation we can indeed find the equation (quantile function (2)):
In practice, we select a random uniform y in (1e-5, 1 - 1e-5) to avoid saturation regions of the sigmoid, then determine x from it. If things got a bit too complicated at this point, use Logistic distribution link above for plot assistance. Naturally, it is evident to notice that if our randomly picked y=0.5, x will be exactly the mean of the Logistic distribution, which is logic. Lucky for us, by picking a y between 1e-5 and 1 - 1e-5 we can actually get a x that is most probable for those mean and scale parameters according to PDF.
Perfect, we got how sampling is done at synthesis, now how do I train my model to make accurate predictions for these means and scales? and in a mixture case, how does the model learn to correctly select the Distribution to use? This brings us to the second part of this long comment :)
Training: To train sampling layers in neural networks (sometimes called parametrization layers), the objective is to have some output distribution parameters that maximize the likelihood of a target y to be sampled from such distribution. i.e: we want to have the distribution parameters that maximize the probability that a real target y is drawn from this distribution (I repeat this a lot because this is all what the following revolves around :) ). It'll get clearer in a minute.
Because maximizing a function f is the equivalent of minimizing its opposite -f, we minimize . The log is applied to have a parabolic shape with the maximum (or minimum if we're talking about -f) as a peak.
So the procedure is very simple, we predict and , apply exponential to get , make the normal (Gaussian) distribution out of that, compute the log probability of a target y to fall inside the predicted Gaussian. After that we simply take the opposite of the log probability for all samples, take the average and minimize the loss function. Because very accurate mean
predictions can results in a very small scale
predictions, we limit the log_scale
with a minimal bound to prevent it from going to infinity (and beyond! :) ).
Now, one might wonder: how on earth is this even related to MoL loss? Well, there's something I didn't tell you, we can actually reformulate the MLE loss to use CDF for training. This part is usually tough to explain with simple words or formulas so let's get some help from some plots shall we? :)
Here's the procedure:
One might ask, why?? well, this is the part that is easily explained with plots :) (Don't ask me how people got this idea though, I believe some kind of sorcery has been used..)
In the figure above, I have made two plots (PDF on the top, CDF on the bottom) for 3 different normal distributions with their respective means and variances as legend. I also made 4 dashed lines, 2 blacks and 2 blues. The 2 blue dashed lines are on top of each other so they appear as one. They refer to x_cdf+ and x_cdf- (explanation on the way). Blue lines follow the real plot scale, that's reality. For the sake of explanation, I break the plot scale rules by a factor of 1000x to create the black dashed lines which are the not real x-cdf+ and x_cdf-. The intersection of x_cdf+ with the distribution CDF gives cdf+. Same applies on cdf- side. cdf+ and cdf- are what I call the "envelope" borders.
Alright let's stick with the CDF for a while. In the figure, I suppose that y = 0 which is represented by the blue dashed line. Now, if we consider x_cdf+ and x_cdf- to be the two dashed black lines, how can we maximize the difference between cdf+ and cdf- (i.e: maximize cdf+ - cdf-) while x_cdf+ and x_cdf- are kept constant (they only depend on the true value of the target y and the model has no control over them)? Actually, by having a closer look to CDF of Gaussian distribution, we notice that the maximal value for cdf+ - cdf- is obtained with the red Gaussian for two reasons:
- The mean of the Gaussian is exactly on top of the real y value. In contrast to the Blue Gaussian which is way off, thus cdf+ and cdf- for the blue distribution are about the same. (saturation region of the CDF).
- The slope of the CDF gets bigger when gets smaller. For , this means that the CDF slope is bigger as long as gets smaller as well. Since log function is an increasing function, the slope is also bigger when gets smaller. As usual, we prevent this value from going to infinity by adding a lower bound.
Now, if we come back to the real plot scale, maximizing the difference between cdf+ and cdf- is really maximizing the slope at the point of x-coordinate y. Now, it is well known that maximal slope for CDF is hit exactly in its inflection point (inflection point: , which means the derivative of f is at its maximum, thus maximal slope (or minimal if the function is decreasing)).
So, just like that, this function (difference between cdf+ and cdf-, does this function not have a name? am I not aware of it?) not only hits its maximum when but also when is the smallest possible (the smaller the scale, the higher the value of the slope). Next, to simply make this a loss function we simply apply a log and a minus sign.
If we want to make a reference to PDF, we indeed notice that the bigger the difference between cdf+ and cdf-, the more the distribution has a probability of picking the real target y (Biggest value is hit for the *red gaussian). Thus this reformulation of the loss function, is still in principle, consistent with the MLE we discussed earlier. (Maximize the probability that a real sample y is drawn from an output distribution . ;) ).
EDIT: If you have some mathematical background, you can understand this as follows (thanks to @m-toman for pointing that out): While keeping in mind that PDF is the derivative of CDF, the trick is to, instead of maximizing the PDF for some target y directly (getting the best distribution that has maximum probability of drawing y), we instead go to the integral (CDF) and maximize its derivative around y. Naturally we wouldn't explicitly be maximizing the derivative itself as it would bring us to PDF.. instead we approximate the derivative around y with a slope approximation (cdf+ - cdf- = cdf_delta which is a good approximation of the derivatives when cdf+ and cdf- are close to each other like in our case). Thus, maximizing the slope (or the derivative) is exactly equivalent to finding the maximum of the derivative function for PDF around y. (This is true as long as the CDF has an S shape like the sigmoid, as the distribution is unimodal, like Gaussian, Student, Logisitc..) Let's not forget we act on the mean and scale of the distribution, so technically wavenet is going through the complex process of determining the best suiting Gaussian (or MoL) distribution that draws a target y from previous samples x and conditioning c. Genius idea right _
That is the base of MoL loss as well. Once this has been assimilated without problems, MoL loss is really a variation with small details. In the next section, I will solely be focusing on those details. :)
Let's start with the distribution training as it's a continuation to what has been discussed in the previous section:
Distribution loss: There's really not much to say here that haven't already been said earlier. the following lines are simply computing the logistic CDF using the equation (1) we presented earlier: https://github.com/Rayhane-mamah/Tacotron-2/blob/d13dbba16f0a434843916b5a8647a42fe34544f5/wavenet_vocoder/models/mixture.py#L44-L49
Then comes the corners and middle special cases treated in PixelCNN++ that are done in the following code (where I shamelessly didn't even change the comments..). To understand what's going on there, please have a look at these softmax-sigmoid relations: https://github.com/Rayhane-mamah/Tacotron-2/blob/d13dbba16f0a434843916b5a8647a42fe34544f5/wavenet_vocoder/models/mixture.py#L51-L60
Distribution selection loss: (This is my perspective, to verify with other sources) Finally, because the model is using a mixture of distributions instead of a single Logistic distribution, we need to also define a loss term that encourages the model to select not only one of the distributions to use depending on the situation, but actually pick the best distribution from the existing ones (for the situation). Let me explain:
After computing the probabilities cdf+ - cdf- that I will call cdf_delta, we compute a second term which is a simple softmax over the distributions probabilities () and compute the new probabilities: . As you probably already know, the softmax returns a vector of floats in [0, 1] where these floats always sum up to 1. So let's take a while to consider that we have 10 Logistic distributions where 9 of them make bad predictions and only 1 of them makes an accurate prediction. If the model assigns about the same weights for every distribution, all "gains" would be multiplied by a factor of 0.1 (output of softmax). When summing the result of this multiplication, such "gain" would be much worse than multiplying the good Logistic cdf_delta with 1 and multiply all the bad cdf_deltas with 0. Thus the model is encouraged into picking the most promising Logistic. This is the core idea. Let's look at the code.
What I previously discussed is in the lines that follow: https://github.com/Rayhane-mamah/Tacotron-2/blob/d13dbba16f0a434843916b5a8647a42fe34544f5/wavenet_vocoder/models/mixture.py#L69-L74
It is worth noting that we use log for numerical stability. It is also worth noting the property of log: . The usage of log_sum_exp instead of a simple log_sum is thus also explained.
There you have it, I assume that's all there is to know on MoL (and Gaussian distribution sampling for WaveNet). I hope this modest comment helped you get the main intuition to actually be able to go through the code and feel like you know what's going on.
Everyone is invited to add anything I may have missed or discuss anything I explained wrong. ;)
In any case, here are some useful references I found when trying to understand MoL/Gaussian myself in case anyone wants to have some more reading:
@Rayhane-mamah That is a legendary comment for sure! I cannot say how much I appreciate it :D I've literally spent most of my day reading through, searching and understanding these ideas. It is indeed a pain haha, but at the same time I feel like I've learned a lot.
@StevenZYj thanks a lot! :)
I know that sometimes I talk about stuff as it's evident.. (I saw most of the involved mathematics in college so I tend to skip some details supposing they're well known..) Please don't hate me for that x)
If you find anything not explained well enough, please feel free to ask any questions, I'll do my best :) Of course if you also happen to find any additional information, please feel free to share with us!
Wow, I'll also check out this amazing comment later on. Thanks for that.
EDIT1: Started reading the first paragraphs and I'd like to comment whenever I think it makes sense ;):
Hey @m-toman
@m-toman you seem to have great ways of simplifying things, I look forward to your future notes!
@Rayhane-mamah Thanks, I've read it through now and the rest sounded pretty straightforward. I think your explanation of the CDF thingy is better for someone with less background knowledge, as mine makes a bit more more assumptions. Regarding the wizardness, first time I read the Tacotron paper I thought: how the hell do they come up with all those architectures? Of course, the use of the encoder/decoder-attention is clear (using seq2seq as a replacement for more classic duration models), but how they came up with something like the CBHG eludes me. Deep Learning often seems like a bit too much trial & error to me, coming from a more traditional speech synthesis background where things are a bit more... transparent and deterministic (although in reality a decision tree with thousands of nodes in the classic HMM-based systems isn't that human-readable either...). But I'm still excited how a single network replaces hundred-thousands of Festival/Festvox/whatever code and 120 individual steps in some HMM training scripts ;).
Back to the topic, a couple hints that hopefully help others:
@m-toman yeah it sure feels like trial and error that come from intuitions (or a strong mathematical research? e.g: parallel wavenet doesn't seem to be intuition based that much.. )
Hi @Rayhane-mamah, Thank you for fixing bugs in wavenet vocoder. To train both models sequentially use: python train.py --model='Tacotron-2', The results of the loss function wavenet are as follows:
[2018-08-16 11:45:10.851] Step 1 [14.957 sec/step, loss=1.17336, avg_loss=1.17336]
[2018-08-16 11:45:13.820] Step 2 [8.963 sec/step, loss=0.67372, avg_loss=0.92354]
[2018-08-16 11:45:16.799] Step 3 [6.968 sec/step, loss=0.25406, avg_loss=0.70038]
[2018-08-16 11:45:19.958] Step 4 [6.016 sec/step, loss=0.05745, avg_loss=0.53965]
[2018-08-16 11:45:23.253] Step 5 [5.472 sec/step, loss=-0.32939, avg_loss=0.36584]
[2018-08-16 11:45:26.504] Step 6 [5.101 sec/step, loss=-0.45899, avg_loss=0.22837]
[2018-08-16 11:45:29.797] Step 7 [4.843 sec/step, loss=-0.27051, avg_loss=0.15710]
[2018-08-16 11:45:33.066] Step 8 [4.646 sec/step, loss=-0.14045, avg_loss=0.11990]
[2018-08-16 11:45:36.283] Step 9 [4.487 sec/step, loss=-0.67957, avg_loss=0.03107]
[2018-08-16 11:45:39.526] Step 10 [4.363 sec/step, loss=-0.44299, avg_loss=-0.01633]
[2018-08-16 11:45:42.807] Step 11 [4.265 sec/step, loss=-0.49541, avg_loss=-0.05988]
[2018-08-16 11:45:46.157] Step 12 [4.188 sec/step, loss=-0.80715, avg_loss=-0.12216]
[2018-08-16 11:45:49.488] Step 13 [4.122 sec/step, loss=-0.55638, avg_loss=-0.15556]
[2018-08-16 11:45:52.746] Step 14 [4.061 sec/step, loss=-0.72860, avg_loss=-0.19649]
[2018-08-16 11:45:55.952] Step 15 [4.004 sec/step, loss=-0.46490, avg_loss=-0.21438]
[2018-08-16 11:45:59.138] Step 16 [3.952 sec/step, loss=-0.76029, avg_loss=-0.24850]
[2018-08-16 11:46:02.332] Step 17 [3.908 sec/step, loss=-0.92684, avg_loss=-0.28841]
[2018-08-16 11:46:05.524] Step 18 [3.868 sec/step, loss=-0.77093, avg_loss=-0.31521]
[2018-08-16 11:46:08.719] Step 19 [3.833 sec/step, loss=-0.07051, avg_loss=-0.30233]
[2018-08-16 11:46:11.915] Step 20 [3.801 sec/step, loss=-0.22756, avg_loss=-0.29859]
[2018-08-16 11:46:15.112] Step 21 [3.772 sec/step, loss=-0.92122, avg_loss=-0.32824]
[2018-08-16 11:46:18.307] Step 22 [3.746 sec/step, loss=-0.83789, avg_loss=-0.35141]
[2018-08-16 11:46:21.502] Step 23 [3.722 sec/step, loss=-0.73162, avg_loss=-0.36794]
[2018-08-16 11:46:21.611]
Generated 32 train batches of size 3 in 0.107 sec
[2018-08-16 11:46:24.699] Step 24 [3.700 sec/step, loss=-0.70173, avg_loss=-0.38185]
[2018-08-16 11:46:27.897] Step 25 [3.680 sec/step, loss=-0.64619, avg_loss=-0.39242]
[2018-08-16 11:46:31.090] Step 26 [3.661 sec/step, loss=-0.94044, avg_loss=-0.41350]
[2018-08-16 11:46:34.288] Step 27 [3.644 sec/step, loss=-0.73025, avg_loss=-0.42523]
[2018-08-16 11:46:37.480] Step 28 [3.628 sec/step, loss=-0.69240, avg_loss=-0.43477]
[2018-08-16 11:46:40.676] Step 29 [3.613 sec/step, loss=-0.62764, avg_loss=-0.44142]
[2018-08-16 11:46:43.868] Step 30 [3.599 sec/step, loss=-0.61241, avg_loss=-0.44712]
[2018-08-16 11:46:47.057] Step 31 [3.586 sec/step, loss=-0.73374, avg_loss=-0.45637]
[2018-08-16 11:46:50.250] Step 32 [3.573 sec/step, loss=-0.84518, avg_loss=-0.46852]
[2018-08-16 11:46:53.442] Step 33 [3.562 sec/step, loss=-0.74587, avg_loss=-0.47692]
[2018-08-16 11:46:56.641] Step 34 [3.551 sec/step, loss=-0.72382, avg_loss=-0.48418]
[2018-08-16 11:46:59.835] Step 35 [3.541 sec/step, loss=-1.03929, avg_loss=-0.50004]
[2018-08-16 11:47:03.029] Step 36 [3.531 sec/step, loss=-1.12073, avg_loss=-0.51729]
[2018-08-16 11:47:06.226] Step 37 [3.522 sec/step, loss=-0.23181, avg_loss=-0.50957]
[2018-08-16 11:47:09.421] Step 38 [3.514 sec/step, loss=-0.56894, avg_loss=-0.51113]
[2018-08-16 11:47:12.609] Step 39 [3.505 sec/step, loss=-0.46203, avg_loss=-0.50987]
[2018-08-16 11:47:15.801] Step 40 [3.497 sec/step, loss=-0.68721, avg_loss=-0.51431]
[2018-08-16 11:47:18.995] Step 41 [3.490 sec/step, loss=-0.74064, avg_loss=-0.51983]
[2018-08-16 11:47:22.184] Step 42 [3.483 sec/step, loss=-0.69302, avg_loss=-0.52395]
[2018-08-16 11:47:25.382] Step 43 [3.476 sec/step, loss=-1.06822, avg_loss=-0.53661]
[2018-08-16 11:47:28.576] Step 44 [3.470 sec/step, loss=-0.70099, avg_loss=-0.54034]
[2018-08-16 11:47:31.767] Step 45 [3.464 sec/step, loss=-0.85478, avg_loss=-0.54733]
[2018-08-16 11:47:34.960] Step 46 [3.458 sec/step, loss=-0.66613, avg_loss=-0.54991]
[2018-08-16 11:47:38.233] Step 47 [3.454 sec/step, loss=-0.89251, avg_loss=-0.55720]
[2018-08-16 11:47:41.457] Step 48 [3.449 sec/step, loss=-0.77377, avg_loss=-0.56172]
[2018-08-16 11:47:44.656] Step 49 [3.444 sec/step, loss=-0.78088, avg_loss=-0.56619]
[2018-08-16 11:47:47.876] Step 50 [3.439 sec/step, loss=-0.65642, avg_loss=-0.56799]
Are the negative values in Loss correct?
Hey @atreyas313, yes that is normal assuming you are using "raw" with 2 output channels (which uses a single gaussian distribution). As explained in my first comment, we minimize the negative log probability of y. With good predictions this probability gets bigger and the loss gets slower leading to even smaller loss (bigger absolute value under 0). So yeah that's normal :)
If you prefer to use MoL instead, change the output_channels parameter to be M * 3 where M is your chosen number of Logistic distributions (usually 10)
Hey @Rayhane-mamah
scale our y with a factor of 2/(2**16 - 1)
Sorry for my silly question, Does this scaling takes place at librosa.load(...)
function or do we have to scale it manually?.
The only scaling written in the code is wav = wav / np.abs(wav).max() * hparams.rescaling_max
I do not think rescaling is a good idea for preprocessing for there might be exceptional peak in some corpus. Therefore the outputs of rescaling might well be abnormal for training.
@begeekmyfriend Hi, what do you mean by "rescaling"? Could you point out where it is embodied in the preprocess.py?
On this line
if the audio is scaled to [-2, 2] rather than [-1, 1], so whether what should I do is just clip the sampled prediction with [-2, 2] ? Need any modify in discretized MoL loss file mixture.py
?
Hi @Rayhane-mamah Sorry for bother you, but the post URL of the Gumbel-Max Trick you linked above has changed.
https://lips.cs.princeton.edu/the-gumbel-max-trick-for-discrete-distributions/
This thread has been amazing.
Could we borrow some ideas behind MOL for making a mixture of gaussians to model the output distribution?... I ask because I haven't had success with MOL for my problem.
Hey @Rayhane-mamah
scale our y with a factor of 2/(2**16 - 1)
Sorry for my silly question, Does this scaling takes place at
librosa.load(...)
function or do we have to scale it manually?. The only scaling written in the code iswav = wav / np.abs(wav).max() * hparams.rescaling_max
what the mean of a factor of 2/(2**16 - 1)
mid_in = inv_stdv * centered_y
log probability in the center of the bin, to be used in extreme cases
(not actually used in this code)
log_pdf_mid = mid_in - log_scales - 2. * tf.nn.softplus(mid_in)
I think the Log(Logistic_pdf) = -mid_in - log_scales - 2. * softplus(-mid_in)
, anyone can help me understand it ?
Hi @Rayhane-mamah , thanks for your legendary answer! While I have more or less grasped your ideas, I have another question that has bothered me for days: why use an approximation of pdf in the first place during training? My guess for the MoL case is that it leads to more straightforward formulation as CDF for logistic distribution is easier to calculate than PDF. But what about the gaussian case? Why not directly use PDF to calculate the MLE loss?
Hello! Thank you for your amazing post but i still have some questions to think about.
As I know, you sourced discretized_mix_logistic_loss
function from the official implementation of PixelCNN++ but you don't update means when counting log prob as in the original code.
I want to know why did you decide to take them off?
Hi @Rayhane-mamah , really nice work, and thanks for the explanations so far.
I have a further question about the training of the distributions and would be glad if you can help me. So with the cdf_delta you basically decide which distribution to choose. But what does that actually mean during backpropagation, especially in terms of the predicting layers for the mean, scale, and logit_probs? After all, the distributions must be influenced, otherwise, all distributions would converge to the "optimum" or not?
Thank you for your time.
Hi all, first thanks @Rayhane-mamah for fixing bugs in wavenet vocoder and making it fully work now :) I've spent several days looking into its implementation and there's a part that really makes me struggled—the implementation of discretized MoL in sampling and loss calculation. I know it is sourced from the official implementation of PixelCNN++, but that does not help much though. Seems like I'm lacking in certain mathematical knowledge. Wonder if anyone can help me out? Thanks.