Some questions on the paper

Darkbblue commented 6 months ago

Why do the features and the meta prompts share the same D? Meta prompts need to have the same dimension as ordinary text encoder embeddings, but the dimension of features is arbitrary due to the flexible choice of blocks to be used for feature extraction.
Can we have an interpretation of the step-by-step refinement? Is there any physical explanation of what's going on if we feed a feature map in the UNet, which is designed to process image latents compressed by the VAE module?

wwqq commented 6 months ago

Q1: We feed the features generated by UNet into a 1x1 convolution to adjust them to the same dimension as the meta prompts. Q2: The step-by-step refinement involves a recurrent refinement training strategy, where the initial output from the UNet is fed back into the same UNet for multiple iterations or loops. Physically, this refinement process can be understood as an iterative enhancement of the feature representation. Each iteration aims to refine and enrich the feature maps by allowing the model to capture more nuanced and complex patterns within the data. This iterative refinement leads to a progressively more detailed and accurate feature representation that is ultimately used for the visual perception tasks

Darkbblue commented 6 months ago

Q1: We feed the features generated by UNet into a 1x1 convolution to adjust them to the same dimension as the meta prompts. Q2: The step-by-step refinement involves a recurrent refinement training strategy, where the initial output from the UNet is fed back into the same UNet for multiple iterations or loops. Physically, this refinement process can be understood as an iterative enhancement of the feature representation. Each iteration aims to refine and enrich the feature maps by allowing the model to capture more nuanced and complex patterns within the data. This iterative refinement leads to a progressively more detailed and accurate feature representation that is ultimately used for the visual perception tasks

Thanks for the reply! Now I understand the first question but still am not very sure of the second one. I mean, the UNet is trained to process compressed images. Should it be able to also process features? Does it mean that the features contain meaningful vision structures, making it somehow similar to a compressed image?

wwqq commented 6 months ago

Q1: We feed the features generated by UNet into a 1x1 convolution to adjust them to the same dimension as the meta prompts. Q2: The step-by-step refinement involves a recurrent refinement training strategy, where the initial output from the UNet is fed back into the same UNet for multiple iterations or loops. Physically, this refinement process can be understood as an iterative enhancement of the feature representation. Each iteration aims to refine and enrich the feature maps by allowing the model to capture more nuanced and complex patterns within the data. This iterative refinement leads to a progressively more detailed and accurate feature representation that is ultimately used for the visual perception tasks

Thanks for the reply! Now I understand the first question but still am not very sure of the second one. I mean, the UNet is trained to process compressed images. Should it be able to also process features? Does it mean that the features contain meaningful vision structures, making it somehow similar to a compressed image?

Yes, features are extracted and represented to capture the critical information necessary for reconstructing or understanding the original image content. They contain meaningful vision structures like edges, textures, colors, or more abstract patterns in the data and can be akin to a compressed form of the image.

fudan-zvg / meta-prompts

Some questions on the paper #6