eladrich / pixel2style2pixel

Official Implementation for "Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation" (CVPR 2021) presenting the pixel2style2pixel (pSp) framework
https://eladrich.github.io/pixel2style2pixel/
MIT License
3.19k stars 570 forks source link

are these tensor shape correct ? #218

Closed vicentowang closed 2 years ago

vicentowang commented 2 years ago

image latent code shape is kind of different with the paper. why y_hat is batch size of 4 ?

yuval-alaluf commented 2 years ago

If you are using a generator with a resolution of 512 then the latent code should have 16 entries so seems fine. In our paper, we used an output resolution of 1024 which uses 18 entries

vicentowang commented 2 years ago

@yuval-alaluf y_hat batch size is 4 ? means one image input, generate 4 images?

yuval-alaluf commented 2 years ago

The batch size of 4 means the number of input images. The number of output images is equal to the number of inputs - you get one output for each input

vicentowang commented 2 years ago

@yuval-alaluf x tensor batch size is 1 , after forward pass ,got y_hat batch size of 4 ? am I any config wrong?

MKFMIKU commented 2 years ago

Facing the same issue.

vicentowang commented 2 years ago

@MKFMIKU @yuval-alaluf I add a conv layer to solved problem.

class GradualStyleBlock(Module): def init(self, in_c, out_c, spatial): super(GradualStyleBlock, self).init() self.out_c = out_c self.spatial = spatial num_pools = int(np.log2(spatial)) modules = [] modules += [Conv2d(in_c, out_c, kernel_size=3, stride=2, padding=1), nn.LeakyReLU()] for i in range(num_pools - 1): modules += [ Conv2d(out_c, out_c, kernel_size=3, stride=2, padding=1), nn.LeakyReLU() ] modules += [Conv2d(out_c, out_c, kernel_size=4, stride=1, padding=1),nn.LeakyReLU()]

    self.convs = nn.Sequential(*modules)
    self.linear = EqualLinear(out_c, out_c, lr_mul=1)

def forward(self, x):
    x = self.convs(x)
    x = x.view(-1, self.out_c)
    x = self.linear(x)
    return x
yuval-alaluf commented 2 years ago

You shouldn't need to change any of the architecture, but glad to see this solved your issue.

vicentowang commented 2 years ago

@yuval-alaluf the problem is caused by the convolution padding with different size of feature map. since you use the view() tensor operation.

vicentowang commented 2 years ago

@yuval-alaluf can you fix the shape problem mentioned above, not just me face the issue. and it is exist. I change the code,but cannot get the right result of image generation.

yuval-alaluf commented 2 years ago

the problem is caused by the convolution padding with different size of feature map. since you use the view() tensor operation.

Not sure i follow. What do you mean by "different size of feature map"? If you have not made any changes to the inner workings of the code, there should be no issue. Can you please provide more details to your data resolution, stylegan output size, etc.

vicentowang commented 2 years ago

@yuval-alaluf stylegan output size is 512x512 , input x size is [1, 3, 512,512], ground truth y size is [1, 3, 512,512] output y_hat shape should be [1, 3, 512,512], but I got [4, 3, 512,512] , problem is that loss(y, y_hat) cannot be caculated,since shape is different. I debug into the problem find that latent code shape batch size is 4 , so got the y_hat batch size of 4,latent shape is the problem.

yuval-alaluf commented 2 years ago

Ok. This seems to be caused because you are using a batch size of 1. Does this occur if you run with a batch size of 2?

vicentowang commented 2 years ago

@yuval-alaluf I haven't try that yet. since my GPU memory is not enough, so I use batch size of 1

yuval-alaluf commented 2 years ago

I believe this is what is causing the issue. Working with a batch size of 1 results in unwanted changes to the tensor dimensions. Some changes will probably be needed to support a batch size of 1.

vicentowang commented 2 years ago

@yuval-alaluf Thanks for your concern.

vicentowang commented 2 years ago

@yuval-alaluf not really. batch size set to 2, got y_hat batch size of 8. image

and what's this for ? the output size is forced to size of 256 at default sets image

yuval-alaluf commented 2 years ago

@vicentowang , I have just run the code and there is no problem with the code. believe you have either changed something in the code or changed the transforms.

Take a look at the original transforms: https://github.com/eladrich/pixel2style2pixel/blob/7a511c687bf2a8a64ba6a47150b37c7108329a6a/configs/transforms_config.py#L21-L37

You can see here that the images are resized to 256. Your images are of size 512 which tells me you have changed something. Please check that your code, transforms, and data are all correct.

As explained in the paper, regarding the face pool, since our inputs are of size 256, we resized the outputs to 256 so that we can compute the loss. This allows us to use lower resolution inputs during training to speed up training. At inference, however, you can still get the full 1024x1024 outputs.

vicentowang commented 2 years ago

I try batch size of 2,no working. here is the problem, codes = self.encoder(x) generate 4 times of latent code which batch size should be the same as x , no network changed, only the resolution --output_size 512, resize=False : y_hat, latent = self.net.forward(x, return_latents=True, resize=False) image

image

yuval-alaluf commented 2 years ago

Did you change the transforms? Based on the size of x (which is 512x512), you did change something.

vicentowang commented 2 years ago

image no scale change since my image input is all 512x512

yuval-alaluf commented 2 years ago

So you did change the code. Put back the rescaling to 256 and see if it fixes your problem. If you change the code and something doesn't work, it most likely means the change you made was the source of the problem.

vicentowang commented 2 years ago

I feel confused that changes to the EncodeTransforms affect the network structure output. but I cannot add transforms.Resize((256, 256)), because my input image is not png or jpg etc. have no api-resize for EXR image which is numpy array format. image

yuval-alaluf commented 2 years ago

If you change the input size, it will change the output size since we've working with convolutions. If you can't find a way to resize the input image, you will probably need to add another downsampling layer in the network.

vicentowang commented 2 years ago

here , x is the network input ,which i ensure it's shape is 512x512, no matter how I change the Preprocess of the image before forward pass as long as Preprocess have no change to the network structure. image

vicentowang commented 2 years ago

I set image size to 256, and now its working, codes = self.encoder(x), codes and x ,y_hat batch size is the same as expected. image

image

image

image

if I want to train image of 512 resolution, how should I change the codes, appreciate.

yuval-alaluf commented 2 years ago

I dont see a reason to train on an input resolution of 512. We showed great results even when using inputs of 256. However, if you insist on training with 512, you need to add another downsampling layer in the GradualStyleBlock

vicentowang commented 2 years ago

i got it , and another problem , I trained a stylegan2-ada model(.pkl model file) which input is W space, and how to made it to receive input of W+ space.

yuval-alaluf commented 2 years ago

pSp already handles working with W+. There is no change needed

yuval-alaluf commented 2 years ago

All the details are in the readme. It is incredibly detailed.

vicentowang commented 2 years ago

.pkl loaded error in your code stylegan structure. my pretrained stylegan2-ada model may be kind of different to yours(not sure), so I have to build stylegan-ada structure in W+ space.