Zheng-Chong / CatVTON

CatVTON is a simple and efficient virtual try-on diffusion model with 1) Lightweight Network (899.06M parameters totally), 2) Parameter-Efficient Training (49.57M parameters trainable) and 3) Simplified Inference (< 8G VRAM for 1024X768 resolution).
Other
729 stars 85 forks source link

Details about training setting #14

Open ymchen7 opened 1 month ago

ymchen7 commented 1 month ago

Good work for the design of such simple vton pipeline.

I have tried to train CatVTON on vitonhd dataset, but the result is a little blurry as shown below. (38k iteration batchsize 8x32 512x384 resolution input, only attention parameters are trained)

image

I'm wondering is there any specific setting or trick in the loss part, for example how to compute the loss? (i.e. compute loss of latents of human images or the concat latents. )

I also noticed the training loss is relatively small at the beginning of the training, is this normal?

Epoch 0, step 0, step_loss: 0.06322, data_time: 2.104, time: 4.421 Epoch 0, step 1, step_loss: 0.04681, data_time: 0.058, time: 2.126 Epoch 0, step 2, step_loss: 0.06814, data_time: 0.058, time: 2.124 Epoch 0, step 3, step_loss: 0.03120, data_time: 0.064, time: 2.139 Epoch 0, step 4, step_loss: 0.02966, data_time: 0.059, time: 2.132 Epoch 0, step 5, step_loss: 0.03977, data_time: 0.059, time: 2.132 Epoch 0, step 6, step_loss: 0.05645, data_time: 0.059, time: 2.133

Zheng-Chong commented 1 month ago

It looks like there is nothing wrong with your visualization results and loss. The loss is calculated on the concat latents. The training details should have been explained in the paper, and no other special tricks in loss part.

BarryyyDrew commented 1 month ago

Is it trainable?Could you please tell me how to train this model?

franciszzj commented 1 month ago

@ymchen7 @Zheng-Chong I trained on both only loss for person latents and loss for concat latents, and found that the results are similar. Because the garment is provided as a condition, it is very easy for the model to learn from it. The part that really makes a difference is the loss of the person latents.

Zheng-Chong commented 1 month ago

@ymchen7 @Zheng-Chong I trained on both only loss for person latents and loss for concat latents, and found that the results are similar. Because the garment is provided as a condition, it is very easy for the model to learn from it. The part that really makes a difference is the loss of the person latents.

That makes sense.

Abhij-ma commented 3 weeks ago

@ymchen7 Hello sir can you tell me how I can train this model on my machine

HamnaAkram commented 3 weeks ago

@ymchen7 how did you create the training pipeline? considering no details are given in the code.

awais-nayyar commented 1 week ago

@ymchen7 I also want to know, can you please guide us how can we train this model on our machine?