Zheng-Chong / CatVTON

CatVTON is a simple and efficient virtual try-on diffusion model with 1) Lightweight Network (899.06M parameters totally), 2) Parameter-Efficient Training (49.57M parameters trainable) and 3) Simplified Inference (< 8G VRAM for 1024X768 resolution).
Other
951 stars 114 forks source link

About the first conv layer of UNet #72

Closed qiuzidian closed 1 month ago

qiuzidian commented 1 month ago

Hello, according to the paper, the input latent channels for UNet is 8 but not 4? I would like to know how you deal with dimension mismatch. From the code, it seems that you only modified the attn layer? Could you help explain

Zheng-Chong commented 1 month ago

The first version on ArXiv contains some errors in the presentation of certain formulas. The number of input channels for the UNet model is 9. We are currently preparing the second version of the paper.

qiuzidian commented 1 month ago