Open win10ogod opened 2 weeks ago
I believe that you'll need to retrain the model from scratch. The paper seems to imply that they initialized the parameters of the backbone with phi-3 and further trained it on their data with frozen VAE and text encoding. Furthermore, the attention mechanism for pure text LLMs might need to be modified for it to work with image tokens. BTW this is the figure illustrating their approach, you might want to check it out:
As @able2608 mentioned, you can replace this LLM, but it requires retraining.
@staoxiao How long does it take to train for different stages(on 104 A800)?
Can I replace phi3 llm? For example, use qwen2 or llama3.2 instead