Closed kirin-from-16 closed 4 months ago
Thank you for your questions. Below are the answers to your inquiries. Please excuse any omissions:
A1: The model was trained based on the weights of SD 1.5, and we restricted the input to 512x512 to match the original SD model parameters. We did not try other resolutions because that would have required retraining some of the frozen structures, such as the VAE encoder.
A2: Indeed, the results show that the generation quality is better under the control condition of the semantic segmentation map compared to other conditions. This is likely because it carries more rich semantic information. The performance also varies across different types of remote sensing images. If you are referring to training and evaluating on RSICD, you can refer to the Txt2Img-mhx strategy, where the same evaluation strategy was ensured: training on the training set and evaluating on the test and validation sets.
A3: Your subsequent questions seem to revolve around one main issue. Here’s the explanation: The RSICD fine-tuned UNET was obtained during the first phase and is used in CRS-Diff as a frozen SD BLOCK and as the trainable SD copy of the Local Control block during the training phase. The loss function used during training still adopts the predictive noise method, which is not much different from ControlNet. "During the training process, only the denoising model is updated" refers specifically to the first phase.
A4: For the concatenation and integration of the caption’s word embedding described, you can refer to the Global_adapter.py file. In fact, we also tried simple concatenation, which still had an effective control effect.
As someone new to this area, I'd very appreciate any insights you can share on the followings:
The rationale behind the image resize choice to 512. Have you try other resolutions?
Are there trade-offs considered during this decision? Do some classes have better generated samples quality than others?
Can you provide the distribution of classes used during training and evaluating? Are the text prompt used when training and evaluating the same?
In Section 3.1, you mentioned that "During the training process, only the denoising model is updated", which is inherited from the SD 1.5 version. What is the relationship between that model and the CLIP encoder of the Global Control block and the trainable SD copy of the Local Control block?
How does the concatenation and integration of the FFN in Eq.2 with the caption’s word embedding described in Fig. 2?
In Section 4.2, 2 fine-tuning phased are mentioned, which i assume happened during the training process. Is that contradictory with this "During the training process, only the denoising model is updated"? Can you provide the loss functions used?
Where in the CRS-Diff architecture in Fig. 2 that the RSICD fine-tuned UNET located?
Thanks in advance for any insights you can share!