As current VLM support mainly focuses on encoder-decoder style models (which encode and decode multimodal information as a hidden state tensor), we need to support auto-regressive VLMs including Chameleon & Anole (which encode and decode multimodal information as tokens).
Required prerequisites
Motivation
As current VLM support mainly focuses on encoder-decoder style models (which encode and decode multimodal information as a hidden state tensor), we need to support auto-regressive VLMs including Chameleon & Anole (which encode and decode multimodal information as tokens).
Solution
No response
Alternatives
No response
Additional context
No response