bcmi / DCI-VTON-Virtual-Try-On

[ACM Multimedia 2023] Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow.
https://arxiv.org/abs/2308.06101
MIT License
405 stars 61 forks source link

Questions about your work #2

Closed Polaralia closed 1 year ago

Polaralia commented 1 year ago

Hi thanks for your contribution. I have some questions on the work as I'm confused by the training details in the paper.

  1. For the L_simple this is a normal L2 reconstruction loss like in paint by example where you only modify the input layer unet to add more channels is this correct? Does this mean you need to train the unet input layer too? I wonder if you tried the palette method of denoising instead.
  2. For the L_vgg for training the diffusion model, do you decode the latents to compute VGG loss? How many steps did you take to decode? I thought this will be very slow for each step.
  3. For both training losses, they are trained together with no two-stage process? What lambda weights did you give to each training loss?
  4. Is I_gt something that is already available, meaning to say you don't need any paired data to train the model? So the training is self-supervised?
  5. Do you train the whole model or just some layers? How long does training take?

Warping model:

  1. Do you use a pretrained warping model, or is this something you train from scratch? If train from scratch, how long did you take and how many data points?
  2. What is the main difference between this warping model and the models from older papers?

Thanks for your work again

Limbor commented 1 year ago

Hi @Polaralia, Thank you for your interest in our work! To answer your questions:

  1. For the model architecture, that is, the structure of UNet, in fact what we use is consistent with the paint by example. Also, our training will be performed on the entire UNet.
  2. For the calculation of vgg loss, we will indeed perform it after decoding the latents. In our experiments, we only sampled once.
  3. During the training process, we train separately for the warping model and the diffusion model.
  4. For the virtual try-on task, we have an image of a person and an image of the clothes that preson is wearing, which constitute a training data pair. I_gt represents the person image.
  5. Yes. And it takes ablout 2 days on 2~4 A100 GPUs.
  6. For the warping module, we simply retrained the model based on the previous method. For more details, please refer to the complete training code we released later.
Polaralia commented 1 year ago

Hi @Limbor thank you for the explanation! Now i understand the paper better. For the warping method which one do you use? I am not familiar with warping but curious about this. Is there a fixed resolution to output the flow pixels and how long do you need to train it?

Limbor commented 1 year ago

@Polaralia In our experiments, we used PF-AFN as the backbone of the warping module and only trained on 256*192 resolution, it took about 10 hours. I hope this clarifies your doubts.

Polaralia commented 1 year ago

Hi @Limbor thank you. Do you know if your method works for other kinds of garments other than the top garment? for example bottom clothes. It seems that the warping model only deals with the top, do you know if the method works for bottom/other kinds of clothes/dress?

Limbor commented 1 year ago

Hi @Polaralia. Actually, in supplementary, we have conducted experiments on DressCode, which contains three categories of clothes (i.e., top, bottom and dresses). According to the experiment results, the warping module is capable to handle different kinds of clothes.

Polaralia commented 1 year ago

@Limbor Thank you for the clarification. This is a very interesting contribution. If i am correct, the warping module for the PF-AFN only deals with top category, but you have trained it for all categories? I am curious do you know for out of distribution data does the prediction work accurately and what are some bad edge cases you have seen? Will you be releasing the checkpoints/code for this too?

I am interested in building on your work and learn how the warping model works with diffusion. I am not publishing in the try on space but working more generally in the diffusion space and will cite your work where relevant if I manage to publish. Thank you for your work!

Limbor commented 1 year ago

@Polaralia Yes, we trained on other clothes categories using the PF-AFN network without specific modifications. In terms of effect, the prediction of pants may be simpler, but the prediction of the length of skirts (especially dresses) is a more difficult issue. Regarding the training code of DressCode, we will also release it later.

Polaralia commented 1 year ago

Hi @Limbor, thanks for sharing! I am certainly interested in checking out the checkpoint/prediction example code for the warping models if you have it available as I am curious about the effects of warping, thank you!

Polaralia commented 1 year ago

Thanks for releasing the code @Limbor. Is the warping model available as well?

Limbor commented 1 year ago

Yes, we have released its model weight, named warp_viton.pth, and we will also release the inference and training code for warping modules and the usage of them.

RishiGitH commented 1 year ago

@Limbor when do you think you will be able to provide code for inference ?

Limbor commented 1 year ago

@RishiGitH We have released the inference and training code for warping module

sudip550 commented 1 year ago

@Limbor i have 5 images of a girl and 2 images of cloth... can we generate image from this??

Limbor commented 1 year ago

@sudip550 If you want to test this model on your own data, you should first do some preprocessing on your data refereing to https://github.com/sangyun884/HR-VITON/issues/45#issue-1515217009 and then do warping and blending with our model.