Open stas-sl opened 5 years ago
Hi!
First, thanks for your work!
I tried to interpolate between 2 faces in the dlatent space (18, 512) and the result seems to be not as meaningful as it is if interpolating between 2 vectors in the qlatent space (512). It kinda works but some transient images contain strange artifacts or do not look like very valid face. Did you notice this effect? Seems like not all points along the linear path in the dlatent space correspond to real faces, though in the qlatent space they do.
Hi @stas-sl ! Actually I was able to interpolate.
person_a = # (18, 512)
person_b = # (18, 512)
for c in np.linspace(0, 1, 50):
generate_image(c*person_a + (1-c)*person_b)
Result: https://giphy.com/gifs/trump-hillary-stylegan-oNPDt7n8KkBlct1SA0
Just wandering if it possible somehow to get latent representations in the original qlatent space to compare interpolation quality.
Yep, that's possible and it works but a lot of details are lost in this case.
For now I'm working on better approach for learning more meaningful latent vectors by using some regularization tricks, which are somehow related to truncation trick. I'm going to commit it this weekend.
I did a couple of experiments to compare interpolation in different spaces.
First, I used random qlatent vectors and correspondig to them dlatent vectors obtained via the mapping network.
qlatent1 = np.random.randn(512)[None, :]
qlatent2 = np.random.randn(512)[None, :]
dlatent1 = Gs.components.mapping.run(qlatent1, None)
dlatent2 = Gs.components.mapping.run(qlatent2, None)
qlatents = np.vstack([(1 - i) * qlatent1 + i * qlatent2 for i in np.linspace(0, 1, 50)])
dlatents = np.vstack([(1 - i) * dlatent1 + i * dlatent2 for i in np.linspace(0, 1, 50)])
dqlatents = Gs.components.mapping.run(qlatents, None)
dimages = Gs.components.synthesis.run(dlatents)
dqimages = Gs.components.synthesis.run(dqlatents)
qimages = Gs.run(qlatents, None)
1) first (left) image is dimages
obtained via interpolation in dlatent space (8, 512)
2) second (middle) image is dqimages
- they are obtained via interpolation in qlatent space (512), then for each vector calculating corresponding dlatent matrix via mapping network, and then passing it to the synthesis network
3) third (right) image is qimages
- they are obtained via single run of whole network interpolating in qlatent space
Example 1
Example 2
Obviously there is a difference especially between 1 vs 2/3 images. In the first image (while interpolating in dlatent space) the transition seems to be more straightforward, though in 2/3 images you can get sometimes some other person in the middle of interpolation. I tried different random vectors and looks like both ways (interpolating in qlatents or dlatent spaces) produce quite meaningful faces along the way, though the path may differ.
Another experiment that I did was interpolating between dlantents obtained from images via optimization:
dlatent1 = ... # (8, 512) matrix obtained via optimization from image
dlatent2 = ... # (8, 512) matrix obtained via optimization from another image
dlatents = np.array([(1 - i) * dlatent1 + i * dlatent2 for i in np.linspace(0, 1, 50)])
images = Gs.components.synthesis.run(dlatents)
The results:
Example 3
Example 4
Example 5
Of course it is rather subjective and depends on concrete source and target images and often produce quite reasonable interpolations, the examples above seem to me a bit artificial in the middle of interpolation. Actually it is hard to say whether the reason is interpolation in dlatent space rather than qlatent, or the way how those dlatents were obtained, or maybe I'm just nitpicking :)
Hi stas-sl, would you like to share the code for the matrix obtained via optimization form image?
Thanks
@JunaidAsghar, I actually used encode_images.py
script as it is written in the readme
@stas-sl thanks for quick respond. Do you have idea on how to trainthe pereptual model once not everytime on each image.
Only what is written here https://www.reddit.com/r/MachineLearning/comments/anzi1t/d_stylegan_but_in_reverse_is_it_possible/
Some say you might try to train an encoder, while others say that it will not work very well.
@stas-sl
Inspired by this, I trained a model (a slightly modified resnet50) to infer high-scale latent space features from a portrait photo, training the model on thousands of universally unique image-dlatent pairs. This approach may also work on the mid and low scale features as well, but I haven't tested it yet. It doesn't yield the same detail as @Puzer's awesome input optimization trick, but the model outputs vectors that land safely in the dense parts of the latent space, making interpolations more stable. It performs very well for me in transferring face position from a video in real-time. The detection and alignment bit is actually the performance bottleneck that I'm working on now. Here's a video: https://twitter.com/calamardh/status/1102441840752713729
Maybe this approach could be used alongside input optimization for faster results.
@gradient-dissenter @stas-sl @tals @sam598 Thanks for your meaningful comments!
My current status: 1) I'm actually playing with training an actual encoder which can predict dlatent (without optimization trick) - I have two models for now - ResNet50 and MobileNetV2 which perform relatively similar. 2) Further improvement of optimization of dlatent - first of all we can initialize dlatent using prediction from 1) model. Moreover, we can do more clever trick and use L2 regularization and keep the optimized dlatent vector close to predicted dlatent from 1). It acts like truncation trick, but it gives more meaningful results. 3) Optimization process itself also was improved. I've changed optimized to Adam and use LR schedules. Good looking results now can be obtained after ~3 sec of optimization (2080 Ti) 4) Useful comment from @tals, that dlatent from mapping network for different layers actually the same. Now I'm trying to train an encoder from 1) but using mixed dlatent - I suppose it can give even better results. 5) I also fixed issue with memory leak which @sam598 pointed out, thanks!
Unfortunately I don't have much time for now, but I expect to polish everything up and publish everything this week.
What can really help, but I don't have capacity for now to do so: 1) Somehow obtain generated images from lower-resolution lods (256\512) - I expect that it can significantly reduce optimization time. 2) Disentangled latent directions based on TL-GAN great research 3) More meaningful interpolations based on Latent space oddity: on the curvature of deep generative models research
- I'm actually playing with training an actual encoder which can predict dlatent (without optimization trick) - I have two models for now - ResNet50 and MobileNetV2 which perform relatively similar.
Does this work similarly to the feed-forward style transfer nets? I've been thinking of trying this out, since the optimization-based approach worked well and the problems are similar.
Disentangled latent directions based on TL-GAN great research
Are you looking at this through the prism of finding the latent of a given picture, or finding "interesting" latent directions (facial hair, gender etc)?
Their general approach is so similar to yours! The disentanglement technique would help with the first use case, but not sure how it would help with the latter.
@Puzer any chance that push is still coming?
@stas-sl
Inspired by this, I trained a model (a slightly modified resnet50) to infer high-scale latent space features from a portrait photo, training the model on thousands of universally unique image-dlatent pairs. This approach may also work on the mid and low scale features as well.
can you share the model of modified resnet50, am not able to generate the image-dlatent with with guassian distribution.
Hi @Puzer, thank you for this great repo ! Do you plan to publish the work you mentioned in this thread soon ?
@kohatkk here is a code to finetune a resnet
import os
import numpy as np
import pickle
import cv2
import dnnlib
import config
import dnnlib.tflib as tflib
from keras.applications.resnet50 import ResNet50
from keras.applications.imagenet_utils import preprocess_input
from keras.layers import Dense
from keras.models import Sequential, load_model
def load_Gs():
tflib.init_tf()
with dnnlib.util.open_url(config.url_ffhq, cache_dir=config.cache_dir) as f:
_, _, Gs = pickle.load(f)
return Gs
def finetune_resnet(save_path, image_size=224, batch_size=10000, test_size=1000, n_epochs=10, max_patience=5, seed=0):
"""
Finetunes a resnet to predict W from X
Generate batches (X, W) of size 'batch_size', iterates 'n_epochs', and repeat while 'max_patience' is reached
on the test set. THe model is saved every time a new best test loss is reached.
:param save_path: str, path to save the model. If already exists, the model will be finetuned.
:param image_size: int
:param batch_size: int
:param test_size: int
:param n_epochs: int
:param max_patience: int
:param seed: int
:return: None
"""
assert image_size >= 224
# Create a test set
print('Creating test set')
np.random.seed(seed)
W_test, X_test = generate_dataset(n=test_size, image_size=image_size)
X_test = preprocess_input(X_test.astype('float'))
# Build model
if os.path.exists(save_path):
print('Loading existing model')
model = load_model(save_path)
else:
print('Building model')
resnet = ResNet50(include_top=False, pooling='avg', input_shape=(image_size, image_size, 3))
model = Sequential()
model.add(resnet)
model.add(Dense(512))
model.compile(loss='mse', metrics=[], optimizer='adam')
# Iterate on batches of size batch_size
print('Training model')
patience = 0
best_loss = np.inf
while (patience <= max_patience):
W_train, X_train = generate_dataset(batch_size) # Not optimal as we reload Gs everytime
X_train = preprocess_input(X_train.astype('float'))
model.fit(X_train, W_train, epochs=n_epochs, verbose=True)
loss = model.evaluate(X_test, W_test)
if loss < best_loss:
print('New best test loss : {:.5f}'.format(loss))
model.save(save_path)
patience = 0
best_loss = loss
else:
patience += 1
if __name__ == '__main__':
# Finetune the resnet
finetune_resnet('data/finetuned_resnet.h5', batch_size=10000, test_size=1000, max_patience=3, n_epochs=10)
@SimJeg This looks really interesting; could you also post the code for your generate_dataset() function?
Here it is !
It's quite quick and dirty as I reload Gs every time I generate a new batch. But time does not really matters here as it converge after a few batches ( = few hours). While it works perfectly for generated images, it does not really work for real world images but faces generated or somehow similar and a good starting point for optimization.
def generate_dataset(n=10000, save_path=None, seed=None, image_size=224, minibatch_size=8):
"""
Generates a dataset of 'n' images of shape ('size', 'size', 3) with random seed 'seed'
along with their dlatent vectors W of shape ('n', 512)
These datasets can serve to train an inverse mapping from X to W as well as explore the latent space
:param n: int
:param image_size: int
:param seed: int
:param savepath: str
:return: numpy arrays of shape(n, 512) and shape(n, size, size, 3)
"""
Gs = load_Gs()
if seed is not None:
Z = np.random.RandomState(seed).randn(n, Gs.input_shape[1])
else:
Z = np.random.randn(n, Gs.input_shape[1])
W = Gs.components.mapping.run(Z, None, minibatch_size=minibatch_size)
X = Gs.components.synthesis.run(W, randomize_noise=False, minibatch_size=minibatch_size, print_progress=True,
output_transform=dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True))
X = np.array([cv2.resize(x, (image_size, image_size)) for x in X])
if save_path is not None:
prefix = '_{}_{}'.format(seed, n)
np.save(os.path.join(os.path.join(save_path, 'W' + prefix)), W[:, 0])
np.save(os.path.join(os.path.join(save_path, 'X' + prefix)), X)
return W[:, 0], X
@SimJeg Thank you very much! Doesn't that take up a lot of memory, generating that many images at once?
@SimJeg
it does not really work for real world images
Why do you think that is? Perhaps some random translations of the image by 5-10 pixels before cropping and resizing would help here?
Also, how did you use it as a starting point for optimization? Did you just run the generator.set_dlatents(d_latent)
line before optimizing in the encode_image.py
script? Can you post the change?
I'm starting to think we should start a fork or new repo at this point so we can all work on improvements at a faster pace. This repo is 3 months old.
@pbaylies I can only fit about 1,250 images into memory at once. A way around this is to load one meta-batch at a time of say 1000 images or so for training, using model.fit(X_train, W_train, epochs=1)
in a loop, and evaluating every 10 meta-batches or so.
Ok @SimJeg et al., playing with this over in Google Colab, here's what I've come up with so far -- https://drive.google.com/open?id=1bVk6AKchrNr3u9tv3SxsgttXNCspvF01
Update: To answer my questions above, setting generator.set_dlatents(d_latent)
indeed works and pixel shifting isn't needed as the approximate encodes work fine with out-of-sample images. Using this method and Adam I can get a decent encode in about 12 seconds.
Ok, I'm happy with the performance of the encoding that I'm getting; it quickly converges to get the basics right, and then incrementally improves after that. Code follows.
ResNet StyleGAN Encoder
Much love to @Puzer and @SimJeg on GitHub for all their hard work on this; see:
https://github.com/Puzer/stylegan-encoder/
https://github.com/Puzer/stylegan-encoder/issues/1#issuecomment-490489772
EDIT: there were a few mistakes in this code, better just to go to my repo at this point, now that I have one: https://github.com/pbaylies/stylegan-encoder
@SimJeg Have you considered using the perceptual loss function of the encoder for your feed-forward network instead of MSE? I expect it to be much slower to train, but it might result in significantly higher image quality.
I'd love to try it myself, but I don't see myself having the time to experiment with it in the near future. That's why I thought I'd share my idea here in case someone else might want to give it a shot.
Edit: "Perceptual Losses for Real-Time Style Transfer and Super-Resolution" (https://arxiv.org/abs/1603.08155) explains how this method can be used to create a feed-forward version of Gatys et al.'s famous Neural Style Transfer, which is also basically an optimization problem trying to minimize perceptual loss.
I've been playing with improving the encoder by updating the loss function, as well as using a pre-trained Resnet to provide a starting point for the dlatents; I'll see about forking / making a repo soon with my findings. Contributions welcome! One thing I noticed, addding an L1 loss to the dlatents themselves helps a lot, to keep them in roughly the same range as normal faces in the rest of the model.
Don't have much time to work on this project but it's great tok ow you had some progress !
To answer a previous I noticed that face recovered using gradient descent have dlatents w of size (18, 512) where the 18 vectors are not that much correlated. It makes sense because as shown in the paper you can mix these 18 vectors to mix styles.
It would make sense training a resnet to predict not only one vector of size 512 but the 18. I made a first try without success...
Changing the loss from mse(w_true, w_pred) to perceptual_loss(stylegan(w_true), stylegan(w_pred)) seems heavy but could be interesting as perceptual_loss proved to be quite efficient !
Good point for l1 loss too ! I don't know if you had a look the dlatents distribution but there look like density (x) = distribution 1 if x < 0 else distribution2 so we could indeed add some prior to amtch such distributions
Le mer. 15 mai 2019 à 18:07, pbaylies notifications@github.com a écrit :
I've been playing with improving the encoder by updating the loss function, as well as using a pre-trained Resnet to provide a starting point for the dlatents; I'll see about forking / making a repo soon with my findings. Contributions welcome! One thing I noticed, addding an L1 loss to the dlatents themselves helps a lot, to keep them in roughly the same range as normal faces in the rest of the model.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Puzer/stylegan-encoder/issues/1?email_source=notifications&email_token=ADE64VLZMXNMCAPGK2KDALLPVQYNPA5CNFSM4GY42WJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVPE2KA#issuecomment-492719400, or mute the thread https://github.com/notifications/unsubscribe-auth/ADE64VKD55AHTQ6GATSHWUTPVQYNPANCNFSM4GY42WJQ .
Hi @SimJeg -- I've started a fork, see here for the resnet training code! Currently I'm mixing up latent values and also using negative truncation for more balance and variation. Thanks for getting me started down this path!
EDIT: The repo is ready to go now, and I've added a link to a pre-trained resnet model as well: https://github.com/pbaylies/stylegan-encoder
Hi @pbaylies ,have you ever tried to generate higher resolution images such as 512x512 or 1024x1024 ?Can i adjust the image size in train_resnet.py
from 256 to 512? I tried but failed,this maybe caused by restrore checkpoint from your sharing pre-trained model.
I want to edit specified human faces on higher resolution ,but the face generated by StyleGAN mostly not the same as original face one .So i doubt if this was caused by image encoder
Hi @shartoo feel free to raise issues on my repo as well; on the pretrained FFHQ model, images are always generated at 1024 anyhow; you can try training a ResNet from scratch with a different input dimension, that should be fine. In my experience, you can get both quicker and better results by sticking to 256 in the encoder; to do better, you might need a smarter loss function, or you might be running up against the limits of that model in StyleGAN.
@Puzer
@gradient-dissenter @stas-sl @tals @sam598 Thanks for your meaningful comments!
My current status:
- I'm actually playing with training an actual encoder which can predict dlatent (without optimization trick) - I have two models for now - ResNet50 and MobileNetV2 which perform relatively similar.
- Further improvement of optimization of dlatent - first of all we can initialize dlatent using prediction from 1) model. Moreover, we can do more clever trick and use L2 regularization and keep the optimized dlatent vector close to predicted dlatent from 1). It acts like truncation trick, but it gives more meaningful results.
- Optimization process itself also was improved. I've changed optimized to Adam and use LR schedules. Good looking results now can be obtained after ~3 sec of optimization (2080 Ti)
- Useful comment from @tals, that dlatent from mapping network for different layers actually the same. Now I'm trying to train an encoder from 1) but using mixed dlatent - I suppose it can give even better results.
- I also fixed issue with memory leak which @sam598 pointed out, thanks!
Unfortunately I don't have much time for now, but I expect to polish everything up and publish everything this week.
What can really help, but I don't have capacity for now to do so:
- Somehow obtain generated images from lower-resolution lods (256\512) - I expect that it can significantly reduce optimization time.
- Disentangled latent directions based on TL-GAN great research
- More meaningful interpolations based on Latent space oddity: on the curvature of deep generative models research
how use LR schedules?thanks
Hi!
First, thanks for your work!
I tried to interpolate between 2 faces in the dlatent space (18, 512) and the result seems to be not as meaningful as it is if interpolating between 2 vectors in the qlatent space (512). It kinda works but some transient images contain strange artifacts or do not look like very valid face. Did you notice this effect? Seems like not all points along the linear path in the dlatent space correspond to real faces, though in the qlatent space they do.
Just wandering if it possible somehow to get latent representations in the original qlatent space to compare interpolation quality.