[Feature] ControlNet support via process similar to PixArt's ControlNet-Transformer

Hi!

I have found your work much interesting and inspiring ever since the first VAR release. However, it would be nice for such a project to implement much-used image-conditional generation in the manner similar to ControlNet.

Transformer architectures, especially autoregressive ones, differ significantly from Unet-controls, however successful efforts have been made.

The PixArt project has achieved this effect by making a mirror transformer of half-depth, delivering the processed conditions with zero-initialized linear layers.

https://github.com/PixArt-alpha/PixArt-alpha/blob/a9f400f09614e212f5f4ace0162d0106d10a5ec8/asset/docs/pixart_controlnet.md

You can see some of the results in PixArt delta's technical report, they look quite promising https://arxiv.org/pdf/2401.05252

I think it will be useful for the popularity, adoption and community around the concept

FoundationVision / LlamaGen

[Feature] ControlNet support via process similar to PixArt's ControlNet-Transformer #24