Issue learning latent encoding for new faces

njordsir commented 4 years ago

I am trying to derive latent encodings for cutom faces, as done in https://github.com/Puzer/stylegan-encoder.

Here are the details after porting the same to pytorch:

from models.stylegan_generator import StyleGANGenerator

#load the pre-trained synthesis network
m_synth = StyleGANGenerator("stylegan_ffhq").model.synthesis.cuda().eval()

#process the output of the synthesis module
class PostProcAfterSynth(nn.Module):
    def __init__(self):
        super(PostProcAfterSynth, self).__init__()
    def forward(self, gen_img):
        #remap to [0,1]
        return (gen_img+1)/2

post_proc_layer = PostProcAfterSynth()

#preprocess the generated image before feeding into perceptual model    
class PreProcBeforePerception(nn.Module):
    def __init__(self, img_size):
        super(PreProcBeforePerception, self).__init__()
        self.img_size = img_size
        self.mean = torch.tensor([0.485, 0.456, 0.406], device=device).view(-1, 1, 1)
        self.std = torch.tensor([0.229, 0.224, 0.225], device=device).view(-1, 1, 1)
    def forward(self, gen_img):
        #resize input image
        gen_img = F.adaptive_avg_pool2d(gen_img, self.img_size)
        #normalize
        gen_img = (gen_img - self.mean) / self.std
        return gen_img

pre_proc_layer = PreProcBeforePerception(img_size=256)

#use pre-trained vgg model for feature extraction
m_vgg = models.vgg16(pretrained=True).features[:16].to(device).eval()

#set up the model
model = nn.Sequential(m_synth)
model.add_module(str(1), post_proc_layer)
model.add_module(str(2), pre_proc_layer)
model.add_module(str(3), m_vgg)

for param in model.parameters():
    param.requires_grad_(False)

print(m_vgg)

Sequential(
  (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU(inplace)
  (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (3): ReLU(inplace)
  (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (6): ReLU(inplace)
  (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (8): ReLU(inplace)
  (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (11): ReLU(inplace)
  (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (13): ReLU(inplace)
  (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (15): ReLU(inplace)
)

As done by Puzer, I select the [conv->conv->pool->conv->conv->pool->conv->conv->conv] section of the vgg network for feature extraction.

Pre-computing the features for the reference image:

ref_img_path = "."
ref_img = np.array(Image.open(ref_img_path))
ref_img = ref_img.astype(np.float32)/255.
ref_img = np.array([np.transpose(ref_img, (2,0,1))])
ref_img = torch.tensor(ref_img, device=device)
ref_img = pre_proc_layer(ref_img)
ref_img_features = m_vgg(ref_img).detach()

Optimization:

trainable_latent = torch.randn((1,18,512), device=device).requires_grad_(True)
loss_func = torch.nn.MSELoss()

optimizer = optim.SGD([trainable_latent], lr=0.5)

losses = []
for i in tqdm(range(1000)):
    optimizer.zero_grad()
    gen_img_features = model(trainable_latent)
    loss = loss_func(gen_img_features, ref_img_features)
    loss_val = loss.data.cpu()
    losses.append(loss_val)
    loss.backward()
    optimizer.step()

The latent encoding and subsequent generated images are of a poor quality. The results are nowhere near as crisp as that by Puzer.

What I have tried:

Learning Z space latent instead of WP+
Variety of optimizers, learning rate, iterations combos

What could be wrong:

There might be issues with my pipeline above (new to pytorch)
There might be some difference in pre-trained vgg networks for pytorch and keras, that I might have failed to take into account.
The perceptual model used is not complex enough. (but it does work for Puzer)

Any help with the above would be much appreciated.

ShenYujun commented 4 years ago

You can try to extract VGG features from a fixed input image using both stylegan-encoder and your own pytorch version to check whether these two tools give same output.

Also, does the loss descend normally during the optimization procedure?

njordsir commented 4 years ago

Original:

Learnt and generated with stylegan-encoder:

Learnt and generated with code above:

The loss does reduce but stabilizes early. The comparison above is with SGD optimizer and learning rate = 1. Other optimizers and lr give similar or worse results.

Maybe this has something to do with differences in optimizer implementations for pytorch and tensorflow/keras and this is just an issue of finding the right hyperparamters to train, but I have had no luck so far.

ShenYujun commented 4 years ago

The loss value from top and bottom figures are clearly different. Can you test whether VGG models from tensorflow/pytorch version give same response to same image? I suggest taking this test as the first step of debugging.

ShenYujun commented 4 years ago

We will support the inversion function in the future version soon. Close this issue for now.

Voyz commented 4 years ago

Hi @ShenYujun - is there any indication as to when the inversion function will be made public? We await it with anticipation!

ShenYujun commented 4 years ago

@Voyz Yes, the code will be public for sure. For now, we still have some work in submission, but a more powerful GAN-related toolkit is coming soon!!

Voyz commented 4 years ago

@ShenYujun That's absolutely wonderful news, thanks! Out of interest, would you be able to give an approximate release date?

ShenYujun commented 4 years ago

@Voyz We may release the code in March. Thanks for your interest and patience.

Voyz commented 4 years ago

@ShenYujun Thank you, appreciate the reply. We truly admire your work, massive kudos for what you've achieved so far! Looking forward to seeing more!

genforce / interfacegan

Issue learning latent encoding for new faces #21