Closed xesdiny closed 5 months ago
Hi~ It is hard to say which part has the greatest impact, because every part is important to the final performance :) I can share our exploration process on each part:
Hope these can help you.
Why use T5 model instead of GPT end-to-end text token to image token?
Thank you~
Hi~ It is hard to say which part has the greatest impact, because every part is important to the final performance :) I can share our exploration process on each part:
- 2D rope: We first use 2D sincos pe, and then change to 2D rope. We find that 2D rope leads to faster convergence in the begining of training than 2D sincos pe, but their final performance are basically same.
- Class-conditional embedding: We directly use the simplest implementation, which is using class-conditional embedding as start token.
- Text-conditional embedding: Similarly, we use the simplest implementation.
- CFG: This is very very very important to the performance, both in FID number and visual quality.
- stage1 data filter: Honestly, we don't have the enough resourece to ablate the filter configuration, like clip similarity score. We just set a desired number, for example, about 50M, and then filter the total images to this number.
- stage2 rewrite pormpt by llava: The long caption generated by llava could help our models to deal with long text conditional image generation, versus the short caption of LAION-COCO. Two other factors are also important: (1) image aesthetic quality (2) image resolution.
Hope these can help you.
I am very surprised at the efficiency of your reply. @PeizeSun Regarding the threshold for selecting image aesthetic quality, did you directly use LAION-COCO to filter out the image coefficient to determine it?
In addition, is the image resolution not fixed in the Tokenizer stage, as long as it meets the ratio?
Why use T5 model instead of GPT end-to-end text token to image token?
Thank you~
Hi~ Training an end-to-end model of language and image requires much more training data and resources, which beyond our capabilities. Sadly…
Hi~ It is hard to say which part has the greatest impact, because every part is important to the final performance :) I can share our exploration process on each part:
- 2D rope: We first use 2D sincos pe, and then change to 2D rope. We find that 2D rope leads to faster convergence in the begining of training than 2D sincos pe, but their final performance are basically same.
- Class-conditional embedding: We directly use the simplest implementation, which is using class-conditional embedding as start token.
- Text-conditional embedding: Similarly, we use the simplest implementation.
- CFG: This is very very very important to the performance, both in FID number and visual quality.
- stage1 data filter: Honestly, we don't have the enough resourece to ablate the filter configuration, like clip similarity score. We just set a desired number, for example, about 50M, and then filter the total images to this number.
- stage2 rewrite pormpt by llava: The long caption generated by llava could help our models to deal with long text conditional image generation, versus the short caption of LAION-COCO. Two other factors are also important: (1) image aesthetic quality (2) image resolution.
Hope these can help you.
I am very surprised at the efficiency of your reply. @PeizeSun Regarding the threshold for selecting image aesthetic quality, did you directly use LAION-COCO to filter out the image coefficient to determine it?
In addition, is the image resolution not fixed in the Tokenizer stage, as long as it meets the ratio?
LAION-COCO metadata has provided the aesthetic score, and we use it to filter out low quality images in training stage 1 .
Training tokenizers can be resolution-variant. But in our experiments, we use fixed resolution, for simplicity.
1 .2D RoPE in at each layer of our model; 2.Class-conditional image embedding; 3.Text-conditional image embedding; 4.CFG;