Still not understanding expected generator_loss_GAN

safi-manar commented 6 years ago

After reading #99, #45, and #116, I'm still unclear on what is expected in the generator_loss_GAN graph.

It seems that @julien2512 has written that we expect generator_loss_GAN decrease.

However, at the end of #99, you see the comment:

The paper states:

As suggested in the original GAN paper, rather than training G to minimize log(1 − D(x, G(x, z)), we instead train to maximize log D(x, G(x, z)).

This seems to suggest we do expect the generator_loss_GAN to increase. In this case, it is not exactly the negative discriminator loss.

In the code:

        gen_loss_GAN = tf.reduce_mean(-tf.log(predict_fake + EPS))
        gen_loss_L1 = tf.reduce_mean(tf.abs(targets - outputs))
        gen_loss = gen_loss_GAN * a.gan_weight + gen_loss_L1 * a.l1_weight

gen_loss is the sum of the two terms, so it seems like we are minimizing gen_loss_GAN. What am I missing here?

julien2512 commented 6 years ago

Nope, I said it depends on the dataset. I saw dataset diverge because it was to hard to find a solution for that dataset. It is important to get somehow random data, without biases.

I try to answer your question. There is several mesures and 2 differents gradients.

The mesures are :

gen_loss_GAN : predict_fake
gen_loss_L1 : targets vs outputs
discrim_loss : predict real vs predict fake

The gradients are on :

discrim_loss : predict real vs predict fake
gen_loss : weight average of predict_fake and targets vs outputs

The first is trying to make predict fake go down (so gen_loss_GAN will go up) :

    with tf.name_scope("discriminator_loss"):
        # minimizing -tf.log will try to get inputs to 1
        # predict_real => 1
        # predict_fake => 0
        discrim_loss = tf.reduce_mean(-(tf.log(predict_real + EPS) + tf.log(1 - predict_fake + EPS)))

It works because predict_real and predict_fake use the same discriminator respectively on actual targets and outputs from generator : with tf.variable_scope("discriminator", reuse=True):

The second is trying to make predict fake go up (so gen_loss_GAN will go down) :

    with tf.name_scope("generator_loss"):
        # predict_fake => 1
        # abs(targets - outputs) => 0
        gen_loss_GAN = tf.reduce_mean(-tf.log(predict_fake + EPS))
        gen_loss_L1 = tf.reduce_mean(tf.abs(targets - outputs))
        gen_loss = gen_loss_GAN * a.gan_weight + gen_loss_L1 * a.l1_weight

So you can see the discriminator is trying to minimize predict_fake, and the other to maximize predict_fake.

But predict_fake is :

        with tf.variable_scope("discriminator", reuse=True):
            # 2x [batch, height, width, channels] => [batch, 30, 30, 1]
            predict_fake = create_discriminator(inputs, outputs)

with outputs :

    with tf.variable_scope("generator"):
        out_channels = int(targets.get_shape()[-1])
        outputs = create_generator(inputs, out_channels)

That mean predict_fake is directly linked to generator. So to say : you can see the discriminator is trying to minimize the generator, and the other to maximize the generator.

On the other hand, the paper said on page 3 : G∗= arg minG maxD LcGAN(G,D) + λL1(G). where G tries to minimize this objective against an ad-versarial D that tries to maximize it

Aren't we good ?

Lienes commented 6 years ago

Could you explain please the key difference of a.gan_weight, a.l1_weight: gen_loss = gen_loss_GAN a.gan_weight + gen_loss_L1 a.l1_weight

julien2512 commented 6 years ago

@Lienes

When you apply a gradient epsilon to a+b, it will apply a epsilon/2 gradient on both a and b.

When you apply a gradient epsilon to l.a+k.b it will apply a epsilon/2l gradient on a and epsilon/2k gradient on b.

So to say a.gan_weight & a.l1_weight are used to adjust the gradient step on respectively gen_loss_GAN and gen_loss_L1.

Lienes commented 6 years ago

I am learning this by doing some little math. I have no much experience with ML and TF, but here is my assumptions of how to interprete tensorboard trends of d_loss at first. If there is something wrong, could you elaborate ? I understand so far that F = -log(r) + log(1-f) tends to be 0 if r = 1-f. So if predict real r->0, and f->1, than F increase. So when d_loss increase, it means that predict fake dominates, so Generator actually make a win. So am i correct that r and f possible range in the code is [0;1] - meaning that's an interpretation of probability distribution ?

julien2512 commented 6 years ago

I understand so far that F = -log(r) + log(1-f) tends to be 0 if r = 1-f. So if predict real r->0, and f->1, than F increase. So when d_loss increase, it means that predict fake dominates, so Generator actually make a win. I am agree

So am i correct that r and f possible range in the code is [0;1] - meaning that's an interpretation of probability distribution ? I am disagree. Your question might be why are r and f in the range [0;1] ?

predict_fake is defined here :

    with tf.name_scope("fake_discriminator"):
        with tf.variable_scope("discriminator", reuse=True):
            # 2x [batch, height, width, channels] => [batch, 30, 30, 1]
            predict_fake = create_discriminator(inputs, outputs)

and the last operation on create_discriminator is a sigmoid :

        # layer_5: [batch, 31, 31, ndf * 8] => [batch, 30, 30, 1]
        with tf.variable_scope("layer_%d" % (len(layers) + 1)):
            convolved = discrim_conv(rectified, out_channels=1, stride=1)
            output = tf.sigmoid(convolved)
            layers.append(output)

that gives a result in the range [0;1] (the same for predict_real).

At time t, f will not be 1-r. And probably will never be. But the gradient algorithm try to make it happen !

f and r are used to make the discriminator gives 0 for real data and 1 for fake data. That is what discrim_loss = tf.reduce_mean(-(tf.log(predict_real + EPS) + tf.log(1 - predict_fake + EPS))) is intended to do. If you need an interpretation, f and r are how good discriminator did it.

Lienes commented 6 years ago

Ahh, yes, there is a sigmoid on the very last layer that can't be anything other than 0 to 1. It's really confusing for me that both discriminators can make r and f probability so that it does not hold r + f = 1.

julien2512 commented 6 years ago

@Lienes Discriminators do not make r anf f probability. You are learning them with a gradient algorithm to minimize discrim_loss = tf.reduce_mean(-(tf.log(predict_real + EPS) + tf.log(1 - predict_fake + EPS)))

Your maths telling you it is better to have predict_real = 1 - predict_fake to find a solution. But the discriminator & generator are made of 50 000 000 parameters to make it. It's more physics than maths here !

Lienes commented 6 years ago

Yes, understood. But for the last thing to make it sore are those gradient charts. I have such triple: pasted image 0 pasted image 0 1 pasted image 0 2

The other misunderstanding is about d_loss. I see that it will try to maximize predict real and minimize predict fake. But than what is the actual gradient that denotes if generator is good or not.

julien2512 commented 6 years ago

If you had known how to mesure generator work the best, you should probably not use deep learning at all.

May be this article might help you to understand better : http://colah.github.io/posts/2014-10-Visualizing-MNIST/

To answer your question with pix2pix-tensorflow, generator_loss_L1 is a good tool to know how much the generator is good. It is a standard approach, as L1 only means "distances between each point".

But the value I think you are looking for is definitely gen_loss_GAN * a.gan_weight + gen_loss_L1 * a.l1_weight = gen_loss that is not added by default on tensorboard.

You should change a few things in the code if you really want to see it (change the model, add extra summary scalar).

Lienes commented 6 years ago

Yes, conceptually i think i understand. But i can't figure out why that gen_loss_L1 actually is so small - it ranges for my case between 0.120 and 0.04. I looked through all the code and i see that targets are images. If gen_loss_L1 = tf.reduce_mean(tf.abs(targets - outputs)) is a diff of generated and actual image than why that reduce_mean is so small. For 256x256 generation, is it calculated like 256/256 = 1 if all piksels fails and ~1/256 if all piksels fits ?

julien2512 commented 6 years ago

raw_input = tf.image.convert_image_dtype(raw_input, dtype=tf.float32) https://www.tensorflow.org/api_docs/python/tf/image/convert_image_dtype

Pixels are converted to values between [0,1] (for each channels).

affinelayer / pix2pix-tensorflow

Still not understanding expected generator_loss_GAN #119