Zheng-Chong / CatVTON

CatVTON is a simple and efficient virtual try-on diffusion model with 1) Lightweight Network (899.06M parameters totally), 2) Parameter-Efficient Training (49.57M parameters trainable) and 3) Simplified Inference (< 8G VRAM for 1024X768 resolution).
Other
891 stars 106 forks source link

Details about training setting #14

Open ymchen7 opened 3 months ago

ymchen7 commented 3 months ago

Good work for the design of such simple vton pipeline.

I have tried to train CatVTON on vitonhd dataset, but the result is a little blurry as shown below. (38k iteration batchsize 8x32 512x384 resolution input, only attention parameters are trained)

image

I'm wondering is there any specific setting or trick in the loss part, for example how to compute the loss? (i.e. compute loss of latents of human images or the concat latents. )

I also noticed the training loss is relatively small at the beginning of the training, is this normal?

Epoch 0, step 0, step_loss: 0.06322, data_time: 2.104, time: 4.421 Epoch 0, step 1, step_loss: 0.04681, data_time: 0.058, time: 2.126 Epoch 0, step 2, step_loss: 0.06814, data_time: 0.058, time: 2.124 Epoch 0, step 3, step_loss: 0.03120, data_time: 0.064, time: 2.139 Epoch 0, step 4, step_loss: 0.02966, data_time: 0.059, time: 2.132 Epoch 0, step 5, step_loss: 0.03977, data_time: 0.059, time: 2.132 Epoch 0, step 6, step_loss: 0.05645, data_time: 0.059, time: 2.133

Zheng-Chong commented 3 months ago

It looks like there is nothing wrong with your visualization results and loss. The loss is calculated on the concat latents. The training details should have been explained in the paper, and no other special tricks in loss part.

BarryyyDrew commented 3 months ago

Is it trainable?Could you please tell me how to train this model?

franciszzj commented 3 months ago

@ymchen7 @Zheng-Chong I trained on both only loss for person latents and loss for concat latents, and found that the results are similar. Because the garment is provided as a condition, it is very easy for the model to learn from it. The part that really makes a difference is the loss of the person latents.

Zheng-Chong commented 3 months ago

@ymchen7 @Zheng-Chong I trained on both only loss for person latents and loss for concat latents, and found that the results are similar. Because the garment is provided as a condition, it is very easy for the model to learn from it. The part that really makes a difference is the loss of the person latents.

That makes sense.

Abhij-ma commented 2 months ago

@ymchen7 Hello sir can you tell me how I can train this model on my machine

HamnaAkram commented 2 months ago

@ymchen7 how did you create the training pipeline? considering no details are given in the code.

awais-nayyar commented 2 months ago

@ymchen7 I also want to know, can you please guide us how can we train this model on our machine?

catsled commented 1 week ago

Good work for the design of such simple vton pipeline.

I have tried to train CatVTON on vitonhd dataset, but the result is a little blurry as shown below. (38k iteration batchsize 8x32 512x384 resolution input, only attention parameters are trained) image

I'm wondering is there any specific setting or trick in the loss part, for example how to compute the loss? (i.e. compute loss of latents of human images or the concat latents. )

I also noticed the training loss is relatively small at the beginning of the training, is this normal?

Epoch 0, step 0, step_loss: 0.06322, data_time: 2.104, time: 4.421 Epoch 0, step 1, step_loss: 0.04681, data_time: 0.058, time: 2.126 Epoch 0, step 2, step_loss: 0.06814, data_time: 0.058, time: 2.124 Epoch 0, step 3, step_loss: 0.03120, data_time: 0.064, time: 2.139 Epoch 0, step 4, step_loss: 0.02966, data_time: 0.059, time: 2.132 Epoch 0, step 5, step_loss: 0.03977, data_time: 0.059, time: 2.132 Epoch 0, step 6, step_loss: 0.05645, data_time: 0.059, time: 2.133

i have also the same issue, the color is not correct for many cloth, such the light red -> deep red. do you find some methods to solve it?