IIGROUP / TediGAN

[CVPR 2021] Pytorch implementation for TediGAN: Text-Guided Diverse Face Image Generation and Manipulation
https://arxiv.org/abs/2012.03308
MIT License
371 stars 59 forks source link

How to generate images from the given description? #5

Closed ozamanan closed 3 years ago

ozamanan commented 3 years ago

Hello, I had a doubt. How do we generate images directly from text descriptions. I executed the invert_v2.py code and it seems that it manipulates an input image.

weihaox commented 3 years ago

The given image is first mapped into the latent space of a pretrained StyleGAN model to obtain its latent code. If the latent code is not obtained from a given image and is sampled randomly (from a normal distribution), you can generate images directly from text descriptions. To be specific, this line: init_z = self.get_init_code(image) should be replaced with the randomly sampled vectors of the same size.

I have been occupied these days and will add the corresponding codes soon. I would appreciate it if you could help with this.

ozamanan commented 3 years ago

I followed the step you mentioned but I am getting the following image as a result. I ran the model for 1000 iterations and the description was "This man is smiling. He is young." with the same learning rate 142_inv

weihaox commented 3 years ago

Hi @ozamanan Thank you for sharing the result.

I ran the model with the description "This man is smiling. He is young" and used the default setting (200 iterations). The result is shown below. image The result after 1000 iterations is image The above results are quite creepy. I felt that the problem may be caused by the CLIP loss. So I changed the loss_weight_clip from 2.0 to 1.0 and got the following result. image I didn't carefully choose the value of loss_weight_clip before. It seems that setting loss_weight_clip to 2.0 is a bit large. Changing loss_weight_clip may be more helpful to generate meaningful results, rather than increasing the number of iterations.

Besides, the CLIP hasn't released the training data. We do not know if it contains all the desired face attributes. The results are not always satisfactory when generating or editing images based on certain descriptions.

ozamanan commented 3 years ago

Yes, I got these results as well, they use the given example image and manipulate it. But I was talking about the generation of images directly from text descriptions, for which I got the above result. Did I miss anything for that?

weihaox commented 3 years ago

Sorry for my misunderstanding...Could you share these codes?

update: I don't know where the problem is without the codes. You can refer to this function to see how to sample latent codes randomly for StyleGAN.

update: for quick generation, replace these lines

x = image[np.newaxis]
x = self.G.to_tensor(x.astype(np.float32))
x.requires_grad = False
init_z = self.get_init_code(image)
z = torch.Tensor(init_z).to(self.run_device)
z.requires_grad = True

with the following codes:

init_z =self.G.sample(1, latent_space_type='wp', z_space_dim=512, num_layers=14)
init_z =  self.G.preprocess(init_z, latent_space_type='wp')
z = torch.Tensor(init_z).to(self.run_device)
z.requires_grad = True

x = self.G._synthesize(init_z, latent_space_type = 'wp')['image']
x = torch.Tensor(x).to(self.run_device)

Below is my obtained result for "This man is smiling. He is young". 2

The key idea is to use the randomly sampled latent codes and their corresponding images to replace the inverted codes and given images. These randomly sampled latent codes might contain quite different attributes compared with given descriptions, leading to unsatisfactory results after optimization. I will think of other strategies instead of random sampling to make the generation process more stable and update the repository when I am available.

ozamanan commented 3 years ago

Thank you very much. This worked perfectly. Now I was just wondering how would you generate the 1024x1024 resolution images. The readme file has a pretrained model stylegan_ffhq_1024.pth but does have a styleganinv_ffhq1024_encoder.pth and styleganinv_ffhq1024_generator.pth

weihaox commented 3 years ago

The basic idea is to invert an given image into the latent space of StyleGAN model (there are three directions for this, you can refer to our GAN inversion survry and a curated list of inversion papers).

For 1024x1024 resolution, you can try other model pretrained on 1024 FFHQ like StyleGAN2Ada. Their official repo provides a projector.py that can directly obtain the inverted latent codes of an image. In this case, you don't need an encoder.

Since the problems of How to generate images from the given description have been solved, please close the issue . You can open another issue for deteiled discuss on 1024 image generation if you want.