I have found your work much interesting and inspiring ever since the first VAR release. However, it would be nice for such a project to implement much-used image-conditional generation in the manner similar to ControlNet.
Transformer architectures, especially autoregressive ones, differ significantly from Unet-controls, however successful efforts have been made.
The PixArt project has achieved this effect by making a mirror transformer of half-depth, delivering the processed conditions with zero-initialized linear layers.
Hi!
I have found your work much interesting and inspiring ever since the first VAR release. However, it would be nice for such a project to implement much-used image-conditional generation in the manner similar to ControlNet.
Transformer architectures, especially autoregressive ones, differ significantly from Unet-controls, however successful efforts have been made.
The PixArt project has achieved this effect by making a mirror transformer of half-depth, delivering the processed conditions with zero-initialized linear layers.
https://github.com/PixArt-alpha/PixArt-alpha/blob/a9f400f09614e212f5f4ace0162d0106d10a5ec8/asset/docs/pixart_controlnet.md
You can see some of the results in PixArt delta's technical report, they look quite promising https://arxiv.org/pdf/2401.05252
I think it will be useful for the popularity, adoption and community around the concept