CatVTON is a simple and efficient virtual try-on diffusion model with 1) Lightweight Network (899.06M parameters totally), 2) Parameter-Efficient Training (49.57M parameters trainable) and 3) Simplified Inference (< 8G VRAM for 1024X768 resolution).
Hello, according to the paper, the input latent channels for UNet is 8 but not 4? I would like to know how you deal with dimension mismatch. From the code, it seems that you only modified the attn layer? Could you help explain
The first version on ArXiv contains some errors in the presentation of certain formulas. The number of input channels for the UNet model is 9. We are currently preparing the second version of the paper.
Hello, according to the paper, the input latent channels for UNet is 8 but not 4? I would like to know how you deal with dimension mismatch. From the code, it seems that you only modified the attn layer? Could you help explain