Puzer / stylegan-encoder

StyleGAN Encoder - converts real images to latent space
Other
1.07k stars 165 forks source link

Interpolation between 2 faces in dlatent space not as meaningful as it is in qlatent space #1

Open stas-sl opened 5 years ago

stas-sl commented 5 years ago

Hi!

First, thanks for your work!

I tried to interpolate between 2 faces in the dlatent space (18, 512) and the result seems to be not as meaningful as it is if interpolating between 2 vectors in the qlatent space (512). It kinda works but some transient images contain strange artifacts or do not look like very valid face. Did you notice this effect? Seems like not all points along the linear path in the dlatent space correspond to real faces, though in the qlatent space they do.

Just wandering if it possible somehow to get latent representations in the original qlatent space to compare interpolation quality.

Puzer commented 5 years ago

Hi!

First, thanks for your work!

I tried to interpolate between 2 faces in the dlatent space (18, 512) and the result seems to be not as meaningful as it is if interpolating between 2 vectors in the qlatent space (512). It kinda works but some transient images contain strange artifacts or do not look like very valid face. Did you notice this effect? Seems like not all points along the linear path in the dlatent space correspond to real faces, though in the qlatent space they do.

Hi @stas-sl ! Actually I was able to interpolate.

person_a = # (18, 512)
person_b = # (18, 512)
for c in np.linspace(0, 1, 50):
    generate_image(c*person_a + (1-c)*person_b)

Result: https://giphy.com/gifs/trump-hillary-stylegan-oNPDt7n8KkBlct1SA0

Just wandering if it possible somehow to get latent representations in the original qlatent space to compare interpolation quality.

Yep, that's possible and it works but a lot of details are lost in this case.

For now I'm working on better approach for learning more meaningful latent vectors by using some regularization tricks, which are somehow related to truncation trick. I'm going to commit it this weekend.

stas-sl commented 5 years ago

I did a couple of experiments to compare interpolation in different spaces.

First, I used random qlatent vectors and correspondig to them dlatent vectors obtained via the mapping network.

qlatent1 = np.random.randn(512)[None, :]
qlatent2 = np.random.randn(512)[None, :]
dlatent1 = Gs.components.mapping.run(qlatent1, None)
dlatent2 = Gs.components.mapping.run(qlatent2, None)

qlatents = np.vstack([(1 - i) * qlatent1 + i * qlatent2 for i in np.linspace(0, 1, 50)])
dlatents = np.vstack([(1 - i) * dlatent1 + i * dlatent2 for i in np.linspace(0, 1, 50)])
dqlatents = Gs.components.mapping.run(qlatents, None)

dimages = Gs.components.synthesis.run(dlatents)
dqimages = Gs.components.synthesis.run(dqlatents)
qimages = Gs.run(qlatents, None)

1) first (left) image is dimages obtained via interpolation in dlatent space (8, 512) 2) second (middle) image is dqimages- they are obtained via interpolation in qlatent space (512), then for each vector calculating corresponding dlatent matrix via mapping network, and then passing it to the synthesis network 3) third (right) image is qimages - they are obtained via single run of whole network interpolating in qlatent space

Example 1 Example 1

Example 2 Example 2

Obviously there is a difference especially between 1 vs 2/3 images. In the first image (while interpolating in dlatent space) the transition seems to be more straightforward, though in 2/3 images you can get sometimes some other person in the middle of interpolation. I tried different random vectors and looks like both ways (interpolating in qlatents or dlatent spaces) produce quite meaningful faces along the way, though the path may differ.

Another experiment that I did was interpolating between dlantents obtained from images via optimization:

dlatent1 = ... # (8, 512) matrix obtained via optimization from image
dlatent2 = ... # (8, 512) matrix obtained via optimization from another image

dlatents = np.array([(1 - i) * dlatent1 + i * dlatent2 for i in np.linspace(0, 1, 50)])
images = Gs.components.synthesis.run(dlatents)

The results:

Example 3 Example 3

Example 4 Example 4

Example 5 Example 5

Of course it is rather subjective and depends on concrete source and target images and often produce quite reasonable interpolations, the examples above seem to me a bit artificial in the middle of interpolation. Actually it is hard to say whether the reason is interpolation in dlatent space rather than qlatent, or the way how those dlatents were obtained, or maybe I'm just nitpicking :)

JunaidAsghar commented 5 years ago

Hi stas-sl, would you like to share the code for the matrix obtained via optimization form image?

Thanks

stas-sl commented 5 years ago

@JunaidAsghar, I actually used encode_images.py script as it is written in the readme

JunaidAsghar commented 5 years ago

@stas-sl thanks for quick respond. Do you have idea on how to trainthe pereptual model once not everytime on each image.

stas-sl commented 5 years ago

Only what is written here https://www.reddit.com/r/MachineLearning/comments/anzi1t/d_stylegan_but_in_reverse_is_it_possible/

Some say you might try to train an encoder, while others say that it will not work very well.

gradient-dissenter commented 5 years ago

@stas-sl

Inspired by this, I trained a model (a slightly modified resnet50) to infer high-scale latent space features from a portrait photo, training the model on thousands of universally unique image-dlatent pairs. This approach may also work on the mid and low scale features as well, but I haven't tested it yet. It doesn't yield the same detail as @Puzer's awesome input optimization trick, but the model outputs vectors that land safely in the dense parts of the latent space, making interpolations more stable. It performs very well for me in transferring face position from a video in real-time. The detection and alignment bit is actually the performance bottleneck that I'm working on now. Here's a video: https://twitter.com/calamardh/status/1102441840752713729

Maybe this approach could be used alongside input optimization for faster results.

Puzer commented 5 years ago

@gradient-dissenter @stas-sl @tals @sam598 Thanks for your meaningful comments!

My current status: 1) I'm actually playing with training an actual encoder which can predict dlatent (without optimization trick) - I have two models for now - ResNet50 and MobileNetV2 which perform relatively similar. 2) Further improvement of optimization of dlatent - first of all we can initialize dlatent using prediction from 1) model. Moreover, we can do more clever trick and use L2 regularization and keep the optimized dlatent vector close to predicted dlatent from 1). It acts like truncation trick, but it gives more meaningful results. 3) Optimization process itself also was improved. I've changed optimized to Adam and use LR schedules. Good looking results now can be obtained after ~3 sec of optimization (2080 Ti) 4) Useful comment from @tals, that dlatent from mapping network for different layers actually the same. Now I'm trying to train an encoder from 1) but using mixed dlatent - I suppose it can give even better results. 5) I also fixed issue with memory leak which @sam598 pointed out, thanks!

Unfortunately I don't have much time for now, but I expect to polish everything up and publish everything this week.

What can really help, but I don't have capacity for now to do so: 1) Somehow obtain generated images from lower-resolution lods (256\512) - I expect that it can significantly reduce optimization time. 2) Disentangled latent directions based on TL-GAN great research 3) More meaningful interpolations based on Latent space oddity: on the curvature of deep generative models research

tals commented 5 years ago
  1. I'm actually playing with training an actual encoder which can predict dlatent (without optimization trick) - I have two models for now - ResNet50 and MobileNetV2 which perform relatively similar.

Does this work similarly to the feed-forward style transfer nets? I've been thinking of trying this out, since the optimization-based approach worked well and the problems are similar.

Disentangled latent directions based on TL-GAN great research

Are you looking at this through the prism of finding the latent of a given picture, or finding "interesting" latent directions (facial hair, gender etc)?

Their general approach is so similar to yours! The disentanglement technique would help with the first use case, but not sure how it would help with the latter.

jcpeterson commented 5 years ago

@Puzer any chance that push is still coming?

kohatkk commented 5 years ago

@stas-sl

Inspired by this, I trained a model (a slightly modified resnet50) to infer high-scale latent space features from a portrait photo, training the model on thousands of universally unique image-dlatent pairs. This approach may also work on the mid and low scale features as well.

can you share the model of modified resnet50, am not able to generate the image-dlatent with with guassian distribution.

SimJeg commented 5 years ago

Hi @Puzer, thank you for this great repo ! Do you plan to publish the work you mentioned in this thread soon ?

@kohatkk here is a code to finetune a resnet

import os
import numpy as np
import pickle
import cv2

import dnnlib
import config
import dnnlib.tflib as tflib

from keras.applications.resnet50 import ResNet50
from keras.applications.imagenet_utils import preprocess_input
from keras.layers import Dense
from keras.models import Sequential, load_model

def load_Gs():
    tflib.init_tf()
    with dnnlib.util.open_url(config.url_ffhq, cache_dir=config.cache_dir) as f:
        _, _, Gs = pickle.load(f)
    return Gs

def finetune_resnet(save_path, image_size=224, batch_size=10000, test_size=1000, n_epochs=10, max_patience=5, seed=0):
    """
    Finetunes a resnet to predict W from X
    Generate batches (X, W) of size 'batch_size', iterates 'n_epochs', and repeat while 'max_patience' is reached
    on the test set. THe model is saved every time a new best test loss is reached.
    :param save_path: str, path to save the model. If already exists, the model will be finetuned.
    :param image_size: int
    :param batch_size: int
    :param test_size: int
    :param n_epochs: int
    :param max_patience: int
    :param seed: int
    :return: None
    """
    assert image_size >= 224

    # Create a test set
    print('Creating test set')
    np.random.seed(seed)
    W_test, X_test = generate_dataset(n=test_size, image_size=image_size)
    X_test = preprocess_input(X_test.astype('float'))

    # Build model
    if os.path.exists(save_path):
        print('Loading existing model')
        model = load_model(save_path)
    else:
        print('Building model')
        resnet = ResNet50(include_top=False, pooling='avg', input_shape=(image_size, image_size, 3))
        model = Sequential()
        model.add(resnet)
        model.add(Dense(512))
        model.compile(loss='mse', metrics=[], optimizer='adam')

    # Iterate on batches of size batch_size
    print('Training model')
    patience = 0
    best_loss = np.inf
    while (patience <= max_patience):
        W_train, X_train = generate_dataset(batch_size)  # Not optimal as we reload Gs everytime
        X_train = preprocess_input(X_train.astype('float'))
        model.fit(X_train, W_train, epochs=n_epochs, verbose=True)
        loss = model.evaluate(X_test, W_test)
        if loss < best_loss:
            print('New best test loss : {:.5f}'.format(loss))
            model.save(save_path)
            patience = 0
            best_loss = loss
        else:
            patience += 1

if __name__ == '__main__':
    # Finetune the resnet
    finetune_resnet('data/finetuned_resnet.h5', batch_size=10000, test_size=1000, max_patience=3, n_epochs=10)
pbaylies commented 5 years ago

@SimJeg This looks really interesting; could you also post the code for your generate_dataset() function?

SimJeg commented 5 years ago

Here it is !

It's quite quick and dirty as I reload Gs every time I generate a new batch. But time does not really matters here as it converge after a few batches ( = few hours). While it works perfectly for generated images, it does not really work for real world images but faces generated or somehow similar and a good starting point for optimization.


def generate_dataset(n=10000, save_path=None, seed=None, image_size=224, minibatch_size=8):
    """
    Generates a dataset of 'n' images of shape ('size', 'size', 3) with random seed 'seed'
    along with their dlatent vectors W of shape ('n', 512)

    These datasets can serve to train an inverse mapping from X to W as well as explore the latent space

    :param n: int
    :param image_size:  int
    :param seed: int
    :param savepath: str
    :return: numpy arrays of shape(n, 512) and shape(n, size, size, 3)
    """

    Gs = load_Gs()

    if seed is not None:
        Z = np.random.RandomState(seed).randn(n, Gs.input_shape[1])
    else:
        Z = np.random.randn(n, Gs.input_shape[1])
    W = Gs.components.mapping.run(Z, None, minibatch_size=minibatch_size)
    X = Gs.components.synthesis.run(W, randomize_noise=False, minibatch_size=minibatch_size, print_progress=True,
                                    output_transform=dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True))
    X = np.array([cv2.resize(x, (image_size, image_size)) for x in X])

    if save_path is not None:
        prefix = '_{}_{}'.format(seed, n)
        np.save(os.path.join(os.path.join(save_path, 'W' + prefix)), W[:, 0])
        np.save(os.path.join(os.path.join(save_path, 'X' + prefix)), X)

    return W[:, 0], X
pbaylies commented 5 years ago

@SimJeg Thank you very much! Doesn't that take up a lot of memory, generating that many images at once?

jcpeterson commented 5 years ago

@SimJeg

it does not really work for real world images

Why do you think that is? Perhaps some random translations of the image by 5-10 pixels before cropping and resizing would help here?

Also, how did you use it as a starting point for optimization? Did you just run the generator.set_dlatents(d_latent) line before optimizing in the encode_image.py script? Can you post the change?

I'm starting to think we should start a fork or new repo at this point so we can all work on improvements at a faster pace. This repo is 3 months old.

@pbaylies I can only fit about 1,250 images into memory at once. A way around this is to load one meta-batch at a time of say 1000 images or so for training, using model.fit(X_train, W_train, epochs=1) in a loop, and evaluating every 10 meta-batches or so.

pbaylies commented 5 years ago

Ok @SimJeg et al., playing with this over in Google Colab, here's what I've come up with so far -- https://drive.google.com/open?id=1bVk6AKchrNr3u9tv3SxsgttXNCspvF01

jcpeterson commented 5 years ago

Update: To answer my questions above, setting generator.set_dlatents(d_latent) indeed works and pixel shifting isn't needed as the approximate encodes work fine with out-of-sample images. Using this method and Adam I can get a decent encode in about 12 seconds.

pbaylies commented 5 years ago

Ok, I'm happy with the performance of the encoding that I'm getting; it quickly converges to get the basics right, and then incrementally improves after that. Code follows.

ResNet StyleGAN Encoder

Much love to @Puzer and @SimJeg on GitHub for all their hard work on this; see:

https://github.com/Puzer/stylegan-encoder/

https://github.com/Puzer/stylegan-encoder/issues/1#issuecomment-490489772

EDIT: there were a few mistakes in this code, better just to go to my repo at this point, now that I have one: https://github.com/pbaylies/stylegan-encoder

Vinno97 commented 5 years ago

@SimJeg Have you considered using the perceptual loss function of the encoder for your feed-forward network instead of MSE? I expect it to be much slower to train, but it might result in significantly higher image quality.

I'd love to try it myself, but I don't see myself having the time to experiment with it in the near future. That's why I thought I'd share my idea here in case someone else might want to give it a shot.

Edit: "Perceptual Losses for Real-Time Style Transfer and Super-Resolution" (https://arxiv.org/abs/1603.08155) explains how this method can be used to create a feed-forward version of Gatys et al.'s famous Neural Style Transfer, which is also basically an optimization problem trying to minimize perceptual loss.

pbaylies commented 5 years ago

I've been playing with improving the encoder by updating the loss function, as well as using a pre-trained Resnet to provide a starting point for the dlatents; I'll see about forking / making a repo soon with my findings. Contributions welcome! One thing I noticed, addding an L1 loss to the dlatents themselves helps a lot, to keep them in roughly the same range as normal faces in the rest of the model.

SimJeg commented 5 years ago

Don't have much time to work on this project but it's great tok ow you had some progress !

To answer a previous I noticed that face recovered using gradient descent have dlatents w of size (18, 512) where the 18 vectors are not that much correlated. It makes sense because as shown in the paper you can mix these 18 vectors to mix styles.

It would make sense training a resnet to predict not only one vector of size 512 but the 18. I made a first try without success...

Changing the loss from mse(w_true, w_pred) to perceptual_loss(stylegan(w_true), stylegan(w_pred)) seems heavy but could be interesting as perceptual_loss proved to be quite efficient !

Good point for l1 loss too ! I don't know if you had a look the dlatents distribution but there look like density (x) = distribution 1 if x < 0 else distribution2 so we could indeed add some prior to amtch such distributions

Le mer. 15 mai 2019 à 18:07, pbaylies notifications@github.com a écrit :

I've been playing with improving the encoder by updating the loss function, as well as using a pre-trained Resnet to provide a starting point for the dlatents; I'll see about forking / making a repo soon with my findings. Contributions welcome! One thing I noticed, addding an L1 loss to the dlatents themselves helps a lot, to keep them in roughly the same range as normal faces in the rest of the model.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Puzer/stylegan-encoder/issues/1?email_source=notifications&email_token=ADE64VLZMXNMCAPGK2KDALLPVQYNPA5CNFSM4GY42WJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVPE2KA#issuecomment-492719400, or mute the thread https://github.com/notifications/unsubscribe-auth/ADE64VKD55AHTQ6GATSHWUTPVQYNPANCNFSM4GY42WJQ .

pbaylies commented 5 years ago

Hi @SimJeg -- I've started a fork, see here for the resnet training code! Currently I'm mixing up latent values and also using negative truncation for more balance and variation. Thanks for getting me started down this path!

EDIT: The repo is ready to go now, and I've added a link to a pre-trained resnet model as well: https://github.com/pbaylies/stylegan-encoder

shartoo commented 5 years ago

Hi @pbaylies ,have you ever tried to generate higher resolution images such as 512x512 or 1024x1024 ?Can i adjust the image size in train_resnet.py from 256 to 512? I tried but failed,this maybe caused by restrore checkpoint from your sharing pre-trained model.

I want to edit specified human faces on higher resolution ,but the face generated by StyleGAN mostly not the same as original face one .So i doubt if this was caused by image encoder

pbaylies commented 5 years ago

Hi @shartoo feel free to raise issues on my repo as well; on the pretrained FFHQ model, images are always generated at 1024 anyhow; you can try training a ResNet from scratch with a different input dimension, that should be fine. In my experience, you can get both quicker and better results by sticking to 256 in the encoder; to do better, you might need a smarter loss function, or you might be running up against the limits of that model in StyleGAN.

jiesonshan commented 4 years ago

@Puzer

@gradient-dissenter @stas-sl @tals @sam598 Thanks for your meaningful comments!

My current status:

  1. I'm actually playing with training an actual encoder which can predict dlatent (without optimization trick) - I have two models for now - ResNet50 and MobileNetV2 which perform relatively similar.
  2. Further improvement of optimization of dlatent - first of all we can initialize dlatent using prediction from 1) model. Moreover, we can do more clever trick and use L2 regularization and keep the optimized dlatent vector close to predicted dlatent from 1). It acts like truncation trick, but it gives more meaningful results.
  3. Optimization process itself also was improved. I've changed optimized to Adam and use LR schedules. Good looking results now can be obtained after ~3 sec of optimization (2080 Ti)
  4. Useful comment from @tals, that dlatent from mapping network for different layers actually the same. Now I'm trying to train an encoder from 1) but using mixed dlatent - I suppose it can give even better results.
  5. I also fixed issue with memory leak which @sam598 pointed out, thanks!

Unfortunately I don't have much time for now, but I expect to polish everything up and publish everything this week.

What can really help, but I don't have capacity for now to do so:

  1. Somehow obtain generated images from lower-resolution lods (256\512) - I expect that it can significantly reduce optimization time.
  2. Disentangled latent directions based on TL-GAN great research
  3. More meaningful interpolations based on Latent space oddity: on the curvature of deep generative models research

how use LR schedules?thanks