TencentARC / SEED-Story

SEED-Story: Multimodal Long Story Generation with Large Language Model
https://arxiv.org/abs/2407.08683
Other
727 stars 56 forks source link

Ablations about the three stage training setting #25

Open jianzongwu opened 5 days ago

jianzongwu commented 5 days ago

Have you ablated the three stage training setting?

If I only train the model using stage 1 and 2, will the performance change a lot?

AndysonYs commented 3 days ago
image image

Hi! We have ablated the 3rd stage. Here is a comparison result. We will update these results on arxiv. We find that the generated images before the de-tokenizer adaptation stage exhibit semantic relevance with consistent backgrounds and characters, thanks to MLLM’s context preservation. However, they suffer from texture distortion and inconsistency in style. After de-tokenizer adaptation, the images show improved consistency in style and character appearance. The calculated FID scores in table 4 confirm that de-tokenizer adaptation enhances image quality.

jianzongwu commented 12 hours ago

Thank you very much for the quick reply. I want to ask another question. In your paper, you mentioned that you feed a story of sequence length 10 into the MLLM, but only supervise on the last sequence. I think it may be possible to train every sequence, and thus reduce the step needed to converge. Can you tell me the reason you only train the MLLM on the last sequence?

AndysonYs commented 11 hours ago

Thank you very much for the quick reply. I want to ask another question. In your paper, you mentioned that you feed a story of sequence length 10 into the MLLM, but only supervise on the last sequence. I think it may be possible to train every sequence, and thus reduce the step needed to converge. Can you tell me the reason you only train the MLLM on the last sequence?

Hi! We actually trained on sequences of a maximum length of 10. For example, during training, we can sample a sequence of 6 text-image pairs, and we only compute losses on the 6th text-image pair.

jianzongwu commented 9 hours ago

Understand.

So why do you supervise on the 6th text-image pair? But not on all text-image pairs?