Detach in Lab3-2 & 3-3 - Githubissues

pandasfang commented 6 years ago

Dear TA:

In the Lab3-2. why don't we need to detach Discriminator when we backward propagate Generator?

############################
# (2) Update G network: maximize log(D(G(z)))
###########################
netG.zero_grad()
labelv = Variable(label.fill_(real_label))  # fake labels are real for generator cost
output = netD(fake)
errG = criterion(output, labelv)
errG.backward()
D_G_z2 = output.data.mean()
optimizerG.step()

a514514772 commented 6 years ago

Hi @pandasfang,

In optimizerG = optim.Adam(netG.parameters(), lr=opt.lr, betas=(opt.beta1, 0.999)), we tell the optimizer that it only needs to update the parameters of generator. That is, although netD will receive gradients, it won't be updated, so we don't have to detach it.

Now, you may have another question. Why do we call detach in this line output = netD(fake.detach()) ? Well, the answer is that it's not necessary to call detach.

Considering the following example which is a very simple auto-encoder.

fc1 = nn.Linear(1, 2)
fc2 = nn.Linear(2, 1)
opt1 = optim.Adam(fc1.parameters(),lr=1e-1)
opt2 = optim.Adam(fc2.parameters(),lr=1e-1)

x = Variable(torch.FloatTensor([5]))
z = fc1(x)
x_p = fc2(z)
cost = (x_p - x) ** 2
'''
print (z)
print (x_p)
print (cost)
'''
opt1.zero_grad()
opt2.zero_grad()

cost.backward()
for n, p in fc1.named_parameters():
    print (n, p.grad.data)

for n, p in fc2.named_parameters():
    print (n, p.grad.data)

opt1.zero_grad()
opt2.zero_grad()

z = fc1(x)
x_p = fc2(z.detach())
cost = (x_p - x) ** 2

cost.backward()
for n, p in fc1.named_parameters():
    print (n, p.grad.data)

for n, p in fc2.named_parameters():
    print (n, p.grad.data)

The output would be :

weight 
 12.0559
 -8.3572
[torch.FloatTensor of size 2x1]

bias 
 2.4112
-1.6714
[torch.FloatTensor of size 2]

weight 
-33.5588 -19.4411
[torch.FloatTensor of size 1x2]

bias 
-9.9940
[torch.FloatTensor of size 1]

================================================

weight 
 0
 0
[torch.FloatTensor of size 2x1]

bias 
 0
 0
[torch.FloatTensor of size 2]

weight 
-33.5588 -19.4411
[torch.FloatTensor of size 1x2]

bias 
-9.9940
[torch.FloatTensor of size 1]

You can find that there's no influence on the gradients of fc2 though we detach the result from fc1. Once we know that the gradient won't be influenced, we can simply use the optimizerD (which only updates the parameters of discriminator) to update the netD without concerning the generator (even when we don't detach it). However, it may lead to some additional computational cost if you don't detach the parts which you don't need.

Thanks

a514514772 commented 6 years ago

I think it's a good question and you guys can verify if what I told is right (maybe I am wrong because I am still learning, too :) ).

If possible, please keep this thread open, and I think it would be helpful for people who want to know more about detach.

It's also highly welcome to discuss with me.

Thanks

yyrkoon27 commented 6 years ago

Soumith's reply in this thread might also clarify things a little bit... [https://github.com/pytorch/examples/issues/116]

a514514772 commented 6 years ago

Hi @yyrkoon27 ,

In this case, it's right. In VAE-GAN, the detach function may be needed for the correctness if you use, for example, opt1 = optim.RMSprop(G.parameters(), lr=1e-1) where G consists of an encoder and a decoder.

2017-fall-DL-training-program / VAE-GAN-and-VAE-GAN

Detach in Lab3-2 & 3-3 #20