facebookresearch / FashionPlus

Fashion++: Minimal Edits for Outfit Improvement
Other
175 stars 39 forks source link

Texture Encoder Network structure #7

Open YQX1996 opened 4 years ago

YQX1996 commented 4 years ago

Hi, I looked at the network structure of texture encoding. Why need to upsample after downsampling? Will it not lose a lot of details? Is it possible to get the encoding result directly without upsampling? and How is the fashion classifier trained? Looking forward to your reply. Thanks.

wlhsiao commented 4 years ago

In our experiments, we did not observe any evidence showing lost details due to upsampling. It is possible to use the encoding without upsampling, but the generator needs to be retrained since it is currently trained to condition on the encoding after upsampling. And using tanh as activation function for the last encoding layer seems to be better than ReLU, so the activation function for the last downsampling layer also needs to be changed. Finally, the segmentation map needs to be downsampled accordingly.

The classifier's network architecture is defined in our supplementary material. The optimizer we used to train the classifier was adam with initial learning rate 0.001 and weight decay 0.0001. We decay our learning rate with multiplicative factor=0.1 at 60, 80, 100 epoch, and trained the model until epoch 120.

YQX1996 commented 4 years ago

Hello, thank you for your answer! I find that your colors are all in lab space. Is the classifier lab? Looking forward to your reply. Thanks.

wlhsiao commented 4 years ago

Right, our cGAN is trained in the Lab color space. For the classifier, it takes in the encodings of clothing pieces as input, not images, so they are not in any color space.

YQX1996 commented 4 years ago

Hi, The classifier takes the image coding as the input, and the fashion of clothing in the paper is judged by the classifier. I want to ask how the classifier distinguishes the fashion? Is the input image labeled? Looking forward to your reply. Thanks.

wlhsiao commented 4 years ago

The input image is not labeled. Since all images we used are collected from celebrities publicly shared photos or online fashion social platforms, most of them are fashionable and we treat all of them as positive. For more details on training the fashionability classifier, please see Sec.3.2 in our paper.

YQX1996 commented 4 years ago

Hi, Have you done experiments in RGB space? Why do you use lab space? I trained the model with RGB, but I couldn't get the experimental results. Thanks.

wlhsiao commented 4 years ago

Yes, we've also done in RGB space, and the original pix2pixHD also trained theirs in RGB. Could you elaborate on what results you couldn't get?

YQX1996 commented 4 years ago

Hi, Thank you for your patience, According to the paper, the whole framework includes texture and shape. In structure, shape branches are similar to BicycleGAN. Should I also train BicycleGAN? In addition, I would like to know how big the input image size is when you are training pix2pixhd?

Thanks.

---Original--- From: "wlhsiao"<notifications@github.com> Date: 2020/1/17 22:01:53 To: "facebookresearch/FashionPlus"<FashionPlus@noreply.github.com>; Cc: "Author"<author@noreply.github.com>;"YQX1996"<1920594298@qq.com>; Subject: Re: [facebookresearch/FashionPlus] Texture Encoder Network structure (#7)

Yes, we've also done in RGB space, and the original pix2pixHD also trained theirs in RGB. Could you elaborate on what results you couldn't get?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

wlhsiao commented 4 years ago

Our shape branch architecture is not that similar as BicycleGAN, since BicycleGAN is also a cGAN, not VAE. You could print out the shape architecture to check. If you want to operate images with different segmentation taxonomy, you would need to re-train the VAE (i.e. shape structure). If you just want to change the texture encoding, you won't need to re-train the VAE. The image size we train our pix2pixHD with is 256 x 256.