How is the post-training for the two tasks of multimodal understanding and image generation conducted? Is it done jointly like in Show-O, or are they trained separately? Also, what are the approximate total number of training samples and the ratio between the two tasks?
How is the post-training for the two tasks of multimodal understanding and image generation conducted? Is it done jointly like in Show-O, or are they trained separately? Also, what are the approximate total number of training samples and the ratio between the two tasks?