Rayhane-mamah / Tacotron-2

DeepMind's Tacotron-2 Tensorflow implementation
MIT License
2.28k stars 904 forks source link

Doubt on use of discretized MoL in sampling and loss calculation #155

Open StevenZYj opened 6 years ago

StevenZYj commented 6 years ago

Hi all, first thanks @Rayhane-mamah for fixing bugs in wavenet vocoder and making it fully work now :) I've spent several days looking into its implementation and there's a part that really makes me struggled—the implementation of discretized MoL in sampling and loss calculation. I know it is sourced from the official implementation of PixelCNN++, but that does not help much though. Seems like I'm lacking in certain mathematical knowledge. Wonder if anyone can help me out? Thanks.

Rayhane-mamah commented 6 years ago

Hello @StevenZYj, thanks for reaching out! Also sorry for being late into make this answer, I didn't find much free time to make a proper legendary comment.. :)

I know the feeling, MoL is a pain to understand haha.. no problem though, let's go through this step by step (with the assistance of Gaussian distribution, things tend to become easier).

Before we get into this, I want to apologize in advance for any typos or any lack of attention mistakes, most of what will follow comes from my personal thinking and observations. While I tried covering most important stuff, I am always open to any improvements to my modest comment! Please sit back and Enjoy :)


Let's start with easy stuff and move to more complicated levels as we proceed:

      The main idea of sampling is that we want to pick a random variable X that is most probable to be picked using the Distribution . The idea of inverse sampling is to do that but using the CDF formula. i.e: we want to pick a random variable X that is most probable under the Logistic PDF. By doing some simple calculation we can indeed find the equation (quantile function (2)):

      In practice, we select a random uniform y in (1e-5, 1 - 1e-5) to avoid saturation regions of the sigmoid, then determine x from it. If things got a bit too complicated at this point, use Logistic distribution link above for plot assistance. Naturally, it is evident to notice that if our randomly picked y=0.5, x will be exactly the mean of the Logistic distribution, which is logic. Lucky for us, by picking a y between 1e-5 and 1 - 1e-5 we can actually get a x that is most probable for those mean and scale parameters according to PDF.

Perfect, we got how sampling is done at synthesis, now how do I train my model to make accurate predictions for these means and scales? and in a mixture case, how does the model learn to correctly select the Distribution to use? This brings us to the second part of this long comment :)

pdf_and_cdf

      In the figure above, I have made two plots (PDF on the top, CDF on the bottom) for 3 different normal distributions with their respective means and variances as legend. I also made 4 dashed lines, 2 blacks and 2 blues. The 2 blue dashed lines are on top of each other so they appear as one. They refer to x_cdf+ and x_cdf- (explanation on the way). Blue lines follow the real plot scale, that's reality. For the sake of explanation, I break the plot scale rules by a factor of 1000x to create the black dashed lines which are the not real x-cdf+ and x_cdf-. The intersection of x_cdf+ with the distribution CDF gives cdf+. Same applies on cdf- side. cdf+ and cdf- are what I call the "envelope" borders.

      Alright let's stick with the CDF for a while. In the figure, I suppose that y = 0 which is represented by the blue dashed line. Now, if we consider x_cdf+ and x_cdf- to be the two dashed black lines, how can we maximize the difference between cdf+ and cdf- (i.e: maximize cdf+ - cdf-) while x_cdf+ and x_cdf- are kept constant (they only depend on the true value of the target y and the model has no control over them)? Actually, by having a closer look to CDF of Gaussian distribution, we notice that the maximal value for cdf+ - cdf- is obtained with the red Gaussian for two reasons:

       - The mean of the Gaussian is exactly on top of the real y value. In contrast to the Blue Gaussian which is way off, thus cdf+ and cdf- for the blue distribution are about the same. (saturation region of the CDF).

       - The slope of the CDF gets bigger when gets smaller. For , this means that the CDF slope is bigger as long as gets smaller as well. Since log function is an increasing function, the slope is also bigger when gets smaller. As usual, we prevent this value from going to infinity by adding a lower bound.

      Now, if we come back to the real plot scale, maximizing the difference between cdf+ and cdf- is really maximizing the slope at the point of x-coordinate y. Now, it is well known that maximal slope for CDF is hit exactly in its inflection point (inflection point: , which means the derivative of f is at its maximum, thus maximal slope (or minimal if the function is decreasing)).

      So, just like that, this function (difference between cdf+ and cdf-, does this function not have a name? am I not aware of it?) not only hits its maximum when but also when is the smallest possible (the smaller the scale, the higher the value of the slope). Next, to simply make this a loss function we simply apply a log and a minus sign.

      If we want to make a reference to PDF, we indeed notice that the bigger the difference between cdf+ and cdf-, the more the distribution has a probability of picking the real target y (Biggest value is hit for the *red gaussian). Thus this reformulation of the loss function, is still in principle, consistent with the MLE we discussed earlier. (Maximize the probability that a real sample y is drawn from an output distribution . ;) ).

EDIT: If you have some mathematical background, you can understand this as follows (thanks to @m-toman for pointing that out): While keeping in mind that PDF is the derivative of CDF, the trick is to, instead of maximizing the PDF for some target y directly (getting the best distribution that has maximum probability of drawing y), we instead go to the integral (CDF) and maximize its derivative around y. Naturally we wouldn't explicitly be maximizing the derivative itself as it would bring us to PDF.. instead we approximate the derivative around y with a slope approximation (cdf+ - cdf- = cdf_delta which is a good approximation of the derivatives when cdf+ and cdf- are close to each other like in our case). Thus, maximizing the slope (or the derivative) is exactly equivalent to finding the maximum of the derivative function for PDF around y. (This is true as long as the CDF has an S shape like the sigmoid, as the distribution is unimodal, like Gaussian, Student, Logisitc..) Let's not forget we act on the mean and scale of the distribution, so technically wavenet is going through the complex process of determining the best suiting Gaussian (or MoL) distribution that draws a target y from previous samples x and conditioning c. Genius idea right _

      That is the base of MoL loss as well. Once this has been assimilated without problems, MoL loss is really a variation with small details. In the next section, I will solely be focusing on those details. :)

Let's start with the distribution training as it's a continuation to what has been discussed in the previous section:

There you have it, I assume that's all there is to know on MoL (and Gaussian distribution sampling for WaveNet). I hope this modest comment helped you get the main intuition to actually be able to go through the code and feel like you know what's going on.

Everyone is invited to add anything I may have missed or discuss anything I explained wrong. ;)

In any case, here are some useful references I found when trying to understand MoL/Gaussian myself in case anyone wants to have some more reading:

StevenZYj commented 6 years ago

@Rayhane-mamah That is a legendary comment for sure! I cannot say how much I appreciate it :D I've literally spent most of my day reading through, searching and understanding these ideas. It is indeed a pain haha, but at the same time I feel like I've learned a lot.

Rayhane-mamah commented 6 years ago

@StevenZYj thanks a lot! :)

I know that sometimes I talk about stuff as it's evident.. (I saw most of the involved mathematics in college so I tend to skip some details supposing they're well known..) Please don't hate me for that x)

If you find anything not explained well enough, please feel free to ask any questions, I'll do my best :) Of course if you also happen to find any additional information, please feel free to share with us!

m-toman commented 6 years ago

Wow, I'll also check out this amazing comment later on. Thanks for that.

EDIT1: Started reading the first paragraphs and I'd like to comment whenever I think it makes sense ;):

Rayhane-mamah commented 6 years ago

Hey @m-toman

@m-toman you seem to have great ways of simplifying things, I look forward to your future notes!

m-toman commented 6 years ago

@Rayhane-mamah Thanks, I've read it through now and the rest sounded pretty straightforward. I think your explanation of the CDF thingy is better for someone with less background knowledge, as mine makes a bit more more assumptions. Regarding the wizardness, first time I read the Tacotron paper I thought: how the hell do they come up with all those architectures? Of course, the use of the encoder/decoder-attention is clear (using seq2seq as a replacement for more classic duration models), but how they came up with something like the CBHG eludes me. Deep Learning often seems like a bit too much trial & error to me, coming from a more traditional speech synthesis background where things are a bit more... transparent and deterministic (although in reality a decision tree with thousands of nodes in the classic HMM-based systems isn't that human-readable either...). But I'm still excited how a single network replaces hundred-thousands of Festival/Festvox/whatever code and 120 individual steps in some HMM training scripts ;).

Back to the topic, a couple hints that hopefully help others:

Rayhane-mamah commented 6 years ago

@m-toman yeah it sure feels like trial and error that come from intuitions (or a strong mathematical research? e.g: parallel wavenet doesn't seem to be intuition based that much.. )

atreyas313 commented 6 years ago

Hi @Rayhane-mamah, Thank you for fixing bugs in wavenet vocoder. To train both models sequentially use: python train.py --model='Tacotron-2', The results of the loss function wavenet are as follows:

[2018-08-16 11:45:10.851] Step 1 [14.957 sec/step, loss=1.17336, avg_loss=1.17336] [2018-08-16 11:45:13.820] Step 2 [8.963 sec/step, loss=0.67372, avg_loss=0.92354] [2018-08-16 11:45:16.799] Step 3 [6.968 sec/step, loss=0.25406, avg_loss=0.70038] [2018-08-16 11:45:19.958] Step 4 [6.016 sec/step, loss=0.05745, avg_loss=0.53965] [2018-08-16 11:45:23.253] Step 5 [5.472 sec/step, loss=-0.32939, avg_loss=0.36584] [2018-08-16 11:45:26.504] Step 6 [5.101 sec/step, loss=-0.45899, avg_loss=0.22837] [2018-08-16 11:45:29.797] Step 7 [4.843 sec/step, loss=-0.27051, avg_loss=0.15710] [2018-08-16 11:45:33.066] Step 8 [4.646 sec/step, loss=-0.14045, avg_loss=0.11990] [2018-08-16 11:45:36.283] Step 9 [4.487 sec/step, loss=-0.67957, avg_loss=0.03107] [2018-08-16 11:45:39.526] Step 10 [4.363 sec/step, loss=-0.44299, avg_loss=-0.01633] [2018-08-16 11:45:42.807] Step 11 [4.265 sec/step, loss=-0.49541, avg_loss=-0.05988] [2018-08-16 11:45:46.157] Step 12 [4.188 sec/step, loss=-0.80715, avg_loss=-0.12216] [2018-08-16 11:45:49.488] Step 13 [4.122 sec/step, loss=-0.55638, avg_loss=-0.15556] [2018-08-16 11:45:52.746] Step 14 [4.061 sec/step, loss=-0.72860, avg_loss=-0.19649] [2018-08-16 11:45:55.952] Step 15 [4.004 sec/step, loss=-0.46490, avg_loss=-0.21438] [2018-08-16 11:45:59.138] Step 16 [3.952 sec/step, loss=-0.76029, avg_loss=-0.24850] [2018-08-16 11:46:02.332] Step 17 [3.908 sec/step, loss=-0.92684, avg_loss=-0.28841] [2018-08-16 11:46:05.524] Step 18 [3.868 sec/step, loss=-0.77093, avg_loss=-0.31521] [2018-08-16 11:46:08.719] Step 19 [3.833 sec/step, loss=-0.07051, avg_loss=-0.30233] [2018-08-16 11:46:11.915] Step 20 [3.801 sec/step, loss=-0.22756, avg_loss=-0.29859] [2018-08-16 11:46:15.112] Step 21 [3.772 sec/step, loss=-0.92122, avg_loss=-0.32824] [2018-08-16 11:46:18.307] Step 22 [3.746 sec/step, loss=-0.83789, avg_loss=-0.35141] [2018-08-16 11:46:21.502] Step 23 [3.722 sec/step, loss=-0.73162, avg_loss=-0.36794] [2018-08-16 11:46:21.611]
Generated 32 train batches of size 3 in 0.107 sec [2018-08-16 11:46:24.699] Step 24 [3.700 sec/step, loss=-0.70173, avg_loss=-0.38185] [2018-08-16 11:46:27.897] Step 25 [3.680 sec/step, loss=-0.64619, avg_loss=-0.39242] [2018-08-16 11:46:31.090] Step 26 [3.661 sec/step, loss=-0.94044, avg_loss=-0.41350] [2018-08-16 11:46:34.288] Step 27 [3.644 sec/step, loss=-0.73025, avg_loss=-0.42523] [2018-08-16 11:46:37.480] Step 28 [3.628 sec/step, loss=-0.69240, avg_loss=-0.43477] [2018-08-16 11:46:40.676] Step 29 [3.613 sec/step, loss=-0.62764, avg_loss=-0.44142] [2018-08-16 11:46:43.868] Step 30 [3.599 sec/step, loss=-0.61241, avg_loss=-0.44712] [2018-08-16 11:46:47.057] Step 31 [3.586 sec/step, loss=-0.73374, avg_loss=-0.45637] [2018-08-16 11:46:50.250] Step 32 [3.573 sec/step, loss=-0.84518, avg_loss=-0.46852] [2018-08-16 11:46:53.442] Step 33 [3.562 sec/step, loss=-0.74587, avg_loss=-0.47692] [2018-08-16 11:46:56.641] Step 34 [3.551 sec/step, loss=-0.72382, avg_loss=-0.48418] [2018-08-16 11:46:59.835] Step 35 [3.541 sec/step, loss=-1.03929, avg_loss=-0.50004] [2018-08-16 11:47:03.029] Step 36 [3.531 sec/step, loss=-1.12073, avg_loss=-0.51729] [2018-08-16 11:47:06.226] Step 37 [3.522 sec/step, loss=-0.23181, avg_loss=-0.50957] [2018-08-16 11:47:09.421] Step 38 [3.514 sec/step, loss=-0.56894, avg_loss=-0.51113] [2018-08-16 11:47:12.609] Step 39 [3.505 sec/step, loss=-0.46203, avg_loss=-0.50987] [2018-08-16 11:47:15.801] Step 40 [3.497 sec/step, loss=-0.68721, avg_loss=-0.51431] [2018-08-16 11:47:18.995] Step 41 [3.490 sec/step, loss=-0.74064, avg_loss=-0.51983] [2018-08-16 11:47:22.184] Step 42 [3.483 sec/step, loss=-0.69302, avg_loss=-0.52395] [2018-08-16 11:47:25.382] Step 43 [3.476 sec/step, loss=-1.06822, avg_loss=-0.53661] [2018-08-16 11:47:28.576] Step 44 [3.470 sec/step, loss=-0.70099, avg_loss=-0.54034] [2018-08-16 11:47:31.767] Step 45 [3.464 sec/step, loss=-0.85478, avg_loss=-0.54733] [2018-08-16 11:47:34.960] Step 46 [3.458 sec/step, loss=-0.66613, avg_loss=-0.54991] [2018-08-16 11:47:38.233] Step 47 [3.454 sec/step, loss=-0.89251, avg_loss=-0.55720] [2018-08-16 11:47:41.457] Step 48 [3.449 sec/step, loss=-0.77377, avg_loss=-0.56172] [2018-08-16 11:47:44.656] Step 49 [3.444 sec/step, loss=-0.78088, avg_loss=-0.56619] [2018-08-16 11:47:47.876] Step 50 [3.439 sec/step, loss=-0.65642, avg_loss=-0.56799]

Are the negative values in Loss correct?

Rayhane-mamah commented 6 years ago

Hey @atreyas313, yes that is normal assuming you are using "raw" with 2 output channels (which uses a single gaussian distribution). As explained in my first comment, we minimize the negative log probability of y. With good predictions this probability gets bigger and the loss gets slower leading to even smaller loss (bigger absolute value under 0). So yeah that's normal :)

If you prefer to use MoL instead, change the output_channels parameter to be M * 3 where M is your chosen number of Logistic distributions (usually 10)

solmn commented 6 years ago

Hey @Rayhane-mamah

scale our y with a factor of 2/(2**16 - 1)

Sorry for my silly question, Does this scaling takes place at librosa.load(...) function or do we have to scale it manually?. The only scaling written in the code is wav = wav / np.abs(wav).max() * hparams.rescaling_max

begeekmyfriend commented 5 years ago

image I do not think rescaling is a good idea for preprocessing for there might be exceptional peak in some corpus. Therefore the outputs of rescaling might well be abnormal for training.

MorganCZY commented 5 years ago

@begeekmyfriend Hi, what do you mean by "rescaling"? Could you point out where it is embodied in the preprocess.py?

begeekmyfriend commented 5 years ago

On this line

kobenaxie commented 5 years ago

if the audio is scaled to [-2, 2] rather than [-1, 1], so whether what should I do is just clip the sampled prediction with [-2, 2] ? Need any modify in discretized MoL loss file mixture.py ?

mindmapper15 commented 4 years ago

Hi @Rayhane-mamah Sorry for bother you, but the post URL of the Gumbel-Max Trick you linked above has changed.

https://lips.cs.princeton.edu/the-gumbel-max-trick-for-discrete-distributions/

d-gil commented 4 years ago

This thread has been amazing.

Could we borrow some ideas behind MOL for making a mixture of gaussians to model the output distribution?... I ask because I haven't had success with MOL for my problem.

zh794390558 commented 2 years ago

Hey @Rayhane-mamah

scale our y with a factor of 2/(2**16 - 1)

Sorry for my silly question, Does this scaling takes place at librosa.load(...) function or do we have to scale it manually?. The only scaling written in the code is wav = wav / np.abs(wav).max() * hparams.rescaling_max

what the mean of a factor of 2/(2**16 - 1)

zh794390558 commented 2 years ago

mid_in = inv_stdv * centered_y

log probability in the center of the bin, to be used in extreme cases

(not actually used in this code)

log_pdf_mid = mid_in - log_scales - 2. * tf.nn.softplus(mid_in)

I think the Log(Logistic_pdf) = -mid_in - log_scales - 2. * softplus(-mid_in), anyone can help me understand it ?

jzhang38 commented 1 year ago

Hi @Rayhane-mamah , thanks for your legendary answer! While I have more or less grasped your ideas, I have another question that has bothered me for days: why use an approximation of pdf in the first place during training? My guess for the MoL case is that it leads to more straightforward formulation as CDF for logistic distribution is easier to calculate than PDF. But what about the gaussian case? Why not directly use PDF to calculate the MLE loss?

univanxx commented 1 year ago

Hello! Thank you for your amazing post but i still have some questions to think about.

As I know, you sourced discretized_mix_logistic_loss function from the official implementation of PixelCNN++ but you don't update means when counting log prob as in the original code.

I want to know why did you decide to take them off?

j-sheikh commented 1 year ago

Hi @Rayhane-mamah , really nice work, and thanks for the explanations so far.

I have a further question about the training of the distributions and would be glad if you can help me. So with the cdf_delta you basically decide which distribution to choose. But what does that actually mean during backpropagation, especially in terms of the predicting layers for the mean, scale, and logit_probs? After all, the distributions must be influenced, otherwise, all distributions would converge to the "optimum" or not?

Thank you for your time.