depthfirstlearning / depthfirstlearning.com

Other
196 stars 29 forks source link

InfoGAN: Add solutions to more questions #5

Open avital opened 6 years ago

maxisawesome commented 5 years ago

I'd like to write up a solution for question 3.1, but I don't quite understand the continuous latent variable loss. Short version, I don't understand this line of code: log_prob_cont = tf.reduce_sum(cont_fake.log_prob(z_cont)) / NB and this part of the solution: "What is logQ(c|x) when c is a Gaussian centered at fθ(x)?" I'm not sure how the Gaussian acts under the log. Q(c|x) returns the mean and std dev of a normal dist, so cont_fake is a distribution, but z_cont is a sample from a distribution - a single number. How do these things compare under log_prob then?

Here's a longer version. If you can answer the short version this might be redundant. Here is some code that is related to the continuous and categorical loss from the InfoGAN collab:

def discriminator():
.....
      logits_cat = slim.fully_connected(encoder, cat_dim, activation_fn=None)
      q_cat = tfd.Categorical(logits=logits_cat)

      cont_vars = slim.fully_connected(encoder, cont_dim * 2, activation_fn=None)
      cont_mu = cont_vars[:, :cont_dim]
      if fix_cont_std:
        cont_sigma = tf.ones_like(cont_mu)
      else:
        cont_sigma = tf.nn.softplus(cont_vars[:, cont_dim:])
      q_cont = tfd.Normal(loc=cont_mu, scale=cont_sigma)
      return q_real, q_cat, q_cont

z_cont = tfd.Uniform(low=-tf.ones([NB, cont_dim]),
                     high=tf.ones([NB, cont_dim])).sample()

d_fake, cat_fake, cont_fake = discriminator(generated, cat_dim, cont_dim)

log_prob_cat = tf.reduce_sum(cat_fake.log_prob(z_cat)) / NB
log_prob_cont = tf.reduce_sum(cont_fake.log_prob(z_cont)) / NB

So for the categorical, I understand what's going on. We use a fully connected layer to output a length cat_dim vector, where each number is a probability of the corresponding class. We can later just log prob to find the loss (-log(pi) where pi is the correct class).

For continuous, predict two numbers for each continuous variable (we have 2 cont var, so we predict 4 numbers). They become the mean and std dev of a distribution in q_cont = tfd.Normal(loc=cont_mu, scale=cont_sigma) We also have the option of setting the std to 1, which I don't fully understand why we are allowed to do. But regardless, q_cont becomes a distribution, though it is later compared to a sample from a distribution in this line: log_prob_cont = tf.reduce_sum(cont_fake.log_prob(z_cont)) / NB How does this act? What are we measuring as a loss? If the normal distribution peaks exactly on the sampled point, that would be the least loss, correct?

Statistics is the weakest part of my understanding of deep learning, and it seems like I'm misunderstanding probability distributions here.

avital commented 5 years ago

Hi Max, thanks for the interest in DFL!

Looking back at our content, I think the "solution" (actually a hint) we have for question 3.1 is misleading. Reading it as is, I don't understand it myself :)

Here's what it should probably say:

If Q(c|x) is a factorized Gaussian centered at f_theta(x), what does log Q(c|x) evaluate to? The answer should be a function of the parameters of the Gaussian. (What are the parameters of a factorized Gaussian?) What about when Q(c|x) is a categorical distribution parameterized by the output of a softmax?

With regard to the larger portion of your question, it's actually preferable to not look at the code for this one. Code can be helpful but ideally students can understand the math in a paper, and then implement it themselves. That's a necessary skill if one is interested in extending techniques rather than using them in the one way someone implemented it. (For example, in this case, could you generalize InfoGAN to broader distribution classes for Q(c|x) that are neither categorical nor factorize Gaussians?)

In the codebase, Q(c|x) returns a mean vector and vector for a diagonal covariance matrix. But that's not what Q(c|x) "is". Q(c|x) is a probability distribution. There's a notation in statistics where when a probability distribution is evaluated at a particular point, it's understood to mean that you're evaluating the distribution's p.d.f. at that point. Can you now answer the question by referring back to the p.d.f. of a factorized Gaussian distribution (and not referring to the code)? What does Q(c|x) turn out to be?

Let me know if this helps, and you are able to answer the question. Then, if so, it could be awesome if you improve the content we have (either improve the hint, or add the full answer, or both).

Thank you!!