eladrich / pixel2style2pixel

Official Implementation for "Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation" (CVPR 2021) presenting the pixel2style2pixel (pSp) framework
https://eladrich.github.io/pixel2style2pixel/
MIT License
3.2k stars 570 forks source link
cvpr2021 generative-adversarial-network image-translation pixel2style2pixel psp-framework psp-model stylegan stylegan-encoder

Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation


We present a generic image-to-image translation framework, pixel2style2pixel (pSp). Our pSp framework is based on a novel encoder network that directly generates a series of style vectors which are fed into a pretrained StyleGAN generator, forming the extended W+ latent space. We first show that our encoder can directly embed real images into W+, with no additional optimization. Next, we propose utilizing our encoder to directly solve image-to-image translation tasks, defining them as encoding problems from some input domain into the latent domain. By deviating from the standard "invert first, edit later" methodology used with previous StyleGAN encoders, our approach can handle a variety of tasks even when the input image is not represented in the StyleGAN domain. We show that solving translation tasks through StyleGAN significantly simplifies the training process, as no adversary is required, has better support for solving tasks without pixel-to-pixel correspondence, and inherently supports multi-modal synthesis via the resampling of styles. Finally, we demonstrate the potential of our framework on a variety of facial image-to-image translation tasks, even when compared to state-of-the-art solutions designed specifically for a single task, and further show that it can be extended beyond the human facial domain.


The proposed pixel2style2pixel framework can be used to solve a wide variety of image-to-image translation tasks. Here we show results of pSp on StyleGAN inversion, multi-modal conditional image synthesis, facial frontalization, inpainting and super-resolution.

Description

Official Implementation of our pSp paper for both training and evaluation. The pSp method extends the StyleGAN model to allow solving different image-to-image translation problems using its encoder.

Table of Contents

Recent Updates

2020.10.04: Initial code release
2020.10.06: Add pSp toonify model (Thanks to the great work from Doron Adler and Justin Pinkney)!
2021.04.23: Added several new features:

2021.07.06: Added support for training with Weights & Biases. See below for details.

Applications

StyleGAN Encoding

Here, we use pSp to find the latent code of real images in the latent domain of a pretrained StyleGAN generator.

Face Frontalization

In this application we want to generate a front-facing face from a given input image.

Conditional Image Synthesis

Here we wish to generate photo-realistic face images from ambiguous sketch images or segmentation maps. Using style-mixing, we inherently support multi-modal synthesis for a single input.

Super Resolution

Given a low-resolution input image, we generate a corresponding high-resolution image. As this too is an ambiguous task, we can use style-mixing to produce several plausible results.

Getting Started

Prerequisites

Installation

Inference Notebook

To help visualize the pSp framework on multiple tasks and to help you get started, we provide a Jupyter notebook found in notebooks/inference_playground.ipynb that allows one to visualize the various applications of pSp.
The notebook will download the necessary pretrained models and run inference on the images found in notebooks/images.
For the tasks of conditional image synthesis and super resolution, the notebook also demonstrates pSp's ability to perform multi-modal synthesis using style-mixing.

Pretrained Models

Please download the pre-trained models from the following links. Each pSp model contains the entire pSp architecture, including the encoder and decoder weights. Path Description
StyleGAN Inversion pSp trained with the FFHQ dataset for StyleGAN inversion.
Face Frontalization pSp trained with the FFHQ dataset for face frontalization.
Sketch to Image pSp trained with the CelebA-HQ dataset for image synthesis from sketches.
Segmentation to Image pSp trained with the CelebAMask-HQ dataset for image synthesis from segmentation maps.
Super Resolution pSp trained with the CelebA-HQ dataset for super resolution (up to x32 down-sampling).
Toonify pSp trained with the FFHQ dataset for toonification using StyleGAN generator from Doron Adler and Justin Pinkney.

If you wish to use one of the pretrained models for training or inference, you may do so using the flag --checkpoint_path.

In addition, we provide various auxiliary models needed for training your own pSp model from scratch as well as pretrained models needed for computing the ID metrics reported in the paper. Path Description
FFHQ StyleGAN StyleGAN model pretrained on FFHQ taken from rosinality with 1024x1024 output resolution.
IR-SE50 Model Pretrained IR-SE50 model taken from TreB1eN for use in our ID loss during pSp training.
MoCo ResNet-50 Pretrained ResNet-50 model trained using MOCOv2 for computing MoCo-based similarity loss on non-facial domains. The model is taken from the official implementation.
CurricularFace Backbone Pretrained CurricularFace model taken from HuangYG123 for use in ID similarity metric computation.
MTCNN Weights for MTCNN model taken from TreB1eN for use in ID similarity metric computation. (Unpack the tar.gz to extract the 3 model weights.)

By default, we assume that all auxiliary models are downloaded and saved to the directory pretrained_models. However, you may use your own paths by changing the necessary values in configs/path_configs.py.

Training

Preparing your Data

As an example, assume we wish to run encoding using ffhq (dataset_type=ffhq_encode). We first go to configs/paths_config.py and define:

dataset_paths = {
    'ffhq': '/path/to/ffhq/images256x256'
    'celeba_test': '/path/to/CelebAMask-HQ/test_img',
}

The transforms for the experiment are defined in the class EncodeTransforms in configs/transforms_config.py.
Finally, in configs/data_configs.py, we define:

DATASETS = {
   'ffhq_encode': {
        'transforms': transforms_config.EncodeTransforms,
        'train_source_root': dataset_paths['ffhq'],
        'train_target_root': dataset_paths['ffhq'],
        'test_source_root': dataset_paths['celeba_test'],
        'test_target_root': dataset_paths['celeba_test'],
    },
}

When defining our datasets, we will take the values in the above dictionary.

Training pSp

The main training script can be found in scripts/train.py.
Intermediate training results are saved to opts.exp_dir. This includes checkpoints, train outputs, and test outputs.
Additionally, if you have tensorboard installed, you can visualize tensorboard logs in opts.exp_dir/logs.

Training the pSp Encoder

python scripts/train.py \
--dataset_type=ffhq_encode \
--exp_dir=/path/to/experiment \
--workers=8 \
--batch_size=8 \
--test_batch_size=8 \
--test_workers=8 \
--val_interval=2500 \
--save_interval=5000 \
--encoder_type=GradualStyleEncoder \
--start_from_latent_avg \
--lpips_lambda=0.8 \
--l2_lambda=1 \
--id_lambda=0.1

Frontalization

python scripts/train.py \
--dataset_type=ffhq_frontalize \
--exp_dir=/path/to/experiment \
--workers=8 \
--batch_size=8 \
--test_batch_size=8 \
--test_workers=8 \
--val_interval=2500 \
--save_interval=5000 \
--encoder_type=GradualStyleEncoder \
--start_from_latent_avg \
--lpips_lambda=0.08 \
--l2_lambda=0.001 \
--lpips_lambda_crop=0.8 \
--l2_lambda_crop=0.01 \
--id_lambda=1 \
--w_norm_lambda=0.005

Sketch to Face

python scripts/train.py \
--dataset_type=celebs_sketch_to_face \
--exp_dir=/path/to/experiment \
--workers=8 \
--batch_size=8 \
--test_batch_size=8 \
--test_workers=8 \
--val_interval=2500 \
--save_interval=5000 \
--encoder_type=GradualStyleEncoder \
--start_from_latent_avg \
--lpips_lambda=0.8 \
--l2_lambda=1 \
--id_lambda=0 \
--w_norm_lambda=0.005 \
--label_nc=1 \
--input_nc=1

Segmentation Map to Face

python scripts/train.py \
--dataset_type=celebs_seg_to_face \
--exp_dir=/path/to/experiment \
--workers=8 \
--batch_size=8 \
--test_batch_size=8 \
--test_workers=8 \
--val_interval=2500 \
--save_interval=5000 \
--encoder_type=GradualStyleEncoder \
--start_from_latent_avg \
--lpips_lambda=0.8 \
--l2_lambda=1 \
--id_lambda=0 \
--w_norm_lambda=0.005 \
--label_nc=19 \
--input_nc=19

Notice with conditional image synthesis no identity loss is utilized (i.e. --id_lambda=0)

Super Resolution

python scripts/train.py \
--dataset_type=celebs_super_resolution \
--exp_dir=/path/to/experiment \
--workers=8 \
--batch_size=8 \
--test_batch_size=8 \
--test_workers=8 \
--val_interval=2500 \
--save_interval=5000 \
--encoder_type=GradualStyleEncoder \
--start_from_latent_avg \
--lpips_lambda=0.8 \
--l2_lambda=1 \
--id_lambda=0.1 \
--w_norm_lambda=0.005 \
--resize_factors=1,2,4,8,16,32

Additional Notes

Identity/Similarity Losses
In pSp, we introduce a facial identity loss using a pre-trained ArcFace network for facial recognition. When operating on the human facial domain, we highly recommend employing this loss objective by using the flag --id_lambda.
In a more recent paper, encoder4editing, the authors generalize this identity loss to other domains by using a MoCo-based ResNet to extract features instead of an ArcFace network. Applying this MoCo-based similarity loss can be done by using the flag --moco_lambda. We recommend setting --moco_lambda=0.5 in your experiments.
Please note, you cannot set both id_lambda and moco_lambda to be active simultaneously (e.g., to use the MoCo-based loss, you should specify, --moco_lambda=0.5 --id_lambda=0).

Weights & Biases Integration

To help track your experiments, we've integrated Weights & Biases into our training process. To enable Weights & Biases (wandb), first make an account on the platform's webpage and install wandb using pip install wandb. Then, to train pSp using wandb, simply add the flag --use_wandb.

Note that when running for the first time, you will be asked to provide your access key which can be accessed via the Weights & Biases platform.

Using Weights & Biases will allow you to visualize the training and testing loss curves as well as intermediate training results.

Testing

Inference

Having trained your model, you can use scripts/inference.py to apply the model on a set of images.
For example,

python scripts/inference.py \
--exp_dir=/path/to/experiment \
--checkpoint_path=experiment/checkpoints/best_model.pt \
--data_path=/path/to/test_data \
--test_batch_size=4 \
--test_workers=4 \
--couple_outputs

Additional notes to consider:

Multi-Modal Synthesis with Style-Mixing

Given a trained model for conditional image synthesis or super-resolution, we can easily generate multiple outputs for a given input image. This can be done using the script scripts/style_mixing.py.
For example, running the following command will perform style-mixing for a segmentation-to-image experiment:

python scripts/style_mixing.py \
--exp_dir=/path/to/experiment \
--checkpoint_path=/path/to/experiment/checkpoints/best_model.pt \
--data_path=/path/to/test_data/ \
--test_batch_size=4 \
--test_workers=4 \
--n_images=25 \
--n_outputs_to_generate=5 \
--latent_mask=8,9,10,11,12,13,14,15,16,17

Here, we inject 5 randomly drawn vectors and perform style-mixing on the latents [8,9,10,11,12,13,14,15,16,17].

Additional notes to consider:

Computing Metrics

Similarly, given a trained model and generated outputs, we can compute the loss metrics on a given dataset.
These scripts receive the inference output directory and ground truth directory.

Additional Applications

To better show the flexibility of our pSp framework we present additional applications below.

As with our main applications, you may download the pretrained models here: Path Description
Toonify pSp trained with the FFHQ dataset for toonification using StyleGAN generator from Doron Adler and Justin Pinkney.

Toonify

Using the toonify StyleGAN built by Doron Adler and Justin Pinkney, we take a real face image and generate a toonified version of the given image. We train the pSp encoder to directly reconstruct real face images inside the toons latent space resulting in a projection of each image to the closest toon. We do so without requiring any labeled pairs or distillation!

This is trained exactly like the StyleGAN inversion task with several changes:

We obtain the best results after around 6000 iterations of training (can be set using --max_steps)

Repository structure

Path Description
pixel2style2pixel Repository root folder
├  configs Folder containing configs defining model/data paths and data transforms
├  criteria Folder containing various loss criterias for training
├  datasets Folder with various dataset objects and augmentations
├  environment Folder containing Anaconda environment used in our experiments
├ models Folder containting all the models and training objects
│  ├  encoders Folder containing our pSp encoder architecture implementation and ArcFace encoder implementation from TreB1eN
│  ├  mtcnn MTCNN implementation from TreB1eN
│  ├  stylegan2 StyleGAN2 model from rosinality
│  └  psp.py Implementation of our pSp framework
├  notebook Folder with jupyter notebook containing pSp inference playground
├  options Folder with training and test command-line options
├  scripts Folder with running scripts for training and inference
├  training Folder with main training logic and Ranger implementation from lessw2020
├  utils Folder with various utility functions

TODOs

Credits

StyleGAN2 implementation:
https://github.com/rosinality/stylegan2-pytorch
Copyright (c) 2019 Kim Seonghyeon
License (MIT) https://github.com/rosinality/stylegan2-pytorch/blob/master/LICENSE

MTCNN, IR-SE50, and ArcFace models and implementations:
https://github.com/TreB1eN/InsightFace_Pytorch
Copyright (c) 2018 TreB1eN
License (MIT) https://github.com/TreB1eN/InsightFace_Pytorch/blob/master/LICENSE

CurricularFace model and implementation:
https://github.com/HuangYG123/CurricularFace
Copyright (c) 2020 HuangYG123
License (MIT) https://github.com/HuangYG123/CurricularFace/blob/master/LICENSE

Ranger optimizer implementation:
https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer
License (Apache License 2.0) https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer/blob/master/LICENSE

LPIPS implementation:
https://github.com/S-aiueo32/lpips-pytorch
Copyright (c) 2020, Sou Uchida
License (BSD 2-Clause) https://github.com/S-aiueo32/lpips-pytorch/blob/master/LICENSE

Please Note: The CUDA files under the StyleGAN2 ops directory are made available under the Nvidia Source Code License-NC

Inspired by pSp

Below are several works inspired by pSp that we found particularly interesting:

Reverse Toonification
Using our pSp encoder, artist Nathan Shipley transformed animated figures and paintings into real life. Check out his amazing work on his twitter page and website.

Deploying pSp with StyleSpace for Editing
Awesome work from Justin Pinkney who deployed our pSp model on Runway and provided support for editing the resulting inversions using the StyleSpace Analysis paper. Check out his repository here.

Encoder4Editing (e4e)
Building on the work of pSp, Tov et al. design an encoder to enable high quality edits on real images. Check out their paper and code.

Style-based Age Manipulation (SAM)
Leveraging pSp and the rich semantics of StyleGAN, SAM learns non-linear latent space paths for modeling the age transformation of real face images. Check out the project page here.

ReStyle
ReStyle builds on recent encoders such as pSp and e4e by introducing an iterative refinment mechanism to gradually improve the inversion of real images. Check out the project page here.

pSp in the Media

Citation

If you use this code for your research, please cite our paper Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation:

@InProceedings{richardson2021encoding,
      author = {Richardson, Elad and Alaluf, Yuval and Patashnik, Or and Nitzan, Yotam and Azar, Yaniv and Shapiro, Stav and Cohen-Or, Daniel},
      title = {Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation},
      booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      month = {June},
      year = {2021}
}