Open jianzongwu opened 5 days ago
Hi! We have ablated the 3rd stage. Here is a comparison result. We will update these results on arxiv. We find that the generated images before the de-tokenizer adaptation stage exhibit semantic relevance with consistent backgrounds and characters, thanks to MLLM’s context preservation. However, they suffer from texture distortion and inconsistency in style. After de-tokenizer adaptation, the images show improved consistency in style and character appearance. The calculated FID scores in table 4 confirm that de-tokenizer adaptation enhances image quality.
Thank you very much for the quick reply. I want to ask another question. In your paper, you mentioned that you feed a story of sequence length 10 into the MLLM, but only supervise on the last sequence. I think it may be possible to train every sequence, and thus reduce the step needed to converge. Can you tell me the reason you only train the MLLM on the last sequence?
Thank you very much for the quick reply. I want to ask another question. In your paper, you mentioned that you feed a story of sequence length 10 into the MLLM, but only supervise on the last sequence. I think it may be possible to train every sequence, and thus reduce the step needed to converge. Can you tell me the reason you only train the MLLM on the last sequence?
Hi! We actually trained on sequences of a maximum length of 10. For example, during training, we can sample a sequence of 6 text-image pairs, and we only compute losses on the 6th text-image pair.
Understand.
So why do you supervise on the 6th text-image pair? But not on all text-image pairs?
Have you ablated the three stage training setting?
If I only train the model using stage 1 and 2, will the performance change a lot?