FoundationVision / LlamaGen

Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
https://arxiv.org/abs/2406.06525
MIT License
1.2k stars 46 forks source link

[Feature] ControlNet support via process similar to PixArt's ControlNet-Transformer #24

Open kabachuha opened 3 months ago

kabachuha commented 3 months ago

Hi!

I have found your work much interesting and inspiring ever since the first VAR release. However, it would be nice for such a project to implement much-used image-conditional generation in the manner similar to ControlNet.

Transformer architectures, especially autoregressive ones, differ significantly from Unet-controls, however successful efforts have been made.

The PixArt project has achieved this effect by making a mirror transformer of half-depth, delivering the processed conditions with zero-initialized linear layers.

https://github.com/PixArt-alpha/PixArt-alpha/blob/a9f400f09614e212f5f4ace0162d0106d10a5ec8/asset/docs/pixart_controlnet.md

You can see some of the results in PixArt delta's technical report, they look quite promising https://arxiv.org/pdf/2401.05252

I think it will be useful for the popularity, adoption and community around the concept

lxa9867 commented 1 month ago

You may try this repo: https://github.com/lxa9867/ControlVAR