Sonettoo / CRS-Diff

38 stars 3 forks source link

Need clarification about training process and model architecture. #1

Closed kirin-from-16 closed 4 months ago

kirin-from-16 commented 6 months ago

As someone new to this area, I'd very appreciate any insights you can share on the followings:

Thanks in advance for any insights you can share!

Sonettoo commented 6 months ago

Thank you for your questions. Below are the answers to your inquiries. Please excuse any omissions:

A1: The model was trained based on the weights of SD 1.5, and we restricted the input to 512x512 to match the original SD model parameters. We did not try other resolutions because that would have required retraining some of the frozen structures, such as the VAE encoder.

A2: Indeed, the results show that the generation quality is better under the control condition of the semantic segmentation map compared to other conditions. This is likely because it carries more rich semantic information. The performance also varies across different types of remote sensing images. If you are referring to training and evaluating on RSICD, you can refer to the Txt2Img-mhx strategy, where the same evaluation strategy was ensured: training on the training set and evaluating on the test and validation sets.

A3: Your subsequent questions seem to revolve around one main issue. Here’s the explanation: The RSICD fine-tuned UNET was obtained during the first phase and is used in CRS-Diff as a frozen SD BLOCK and as the trainable SD copy of the Local Control block during the training phase. The loss function used during training still adopts the predictive noise method, which is not much different from ControlNet. "During the training process, only the denoising model is updated" refers specifically to the first phase.

A4: For the concatenation and integration of the caption’s word embedding described, you can refer to the Global_adapter.py file. In fact, we also tried simple concatenation, which still had an effective control effect.