Closed ngsitrong26 closed 5 months ago
@JackAILab I am also concerned about the issue. You have text embedding and image embedding. Do you use cross-attention or decouple ?
First, the hidden states after self-attention are used with the text for cross-attention. Then, the same hidden states are used with image embeds for cross-attention calculation. Finally, the results of the two attentions are added together, so there are two decoupled cross-attention operations.
So may be the same with ip-adapter, right ?
hi, @vuongminh1907 , yes, ConsistentID uses the same attention decoupling structure as IPAdapter.
@ngsitrong26 , the inference demo also uses decoupled cross attention. Consistent_AttProcessor defines the attention score of multimodal text, that's important, the attention score matrix of multimodal text needs to be returned from Consistent_AttProcessor. attention_L157.
@JackAILab If I'm not using the ID-Preservation network functionality, can I directly use the Fine-grained Multimodal text prompts in the IP-Adapter model without redefining the attention mechanism in it?
@JackAILab i went deeper into the source code and found that only Consistent_IPAttProcessor handles decouple cross attention, while Consistent_AttProcessor does not. So, do you use Consistent_IPAttProcessor or Consistent_AttProcessor? I also see in the checkpoint the weight of the decouple
@JackAILab In class Consistent_AttProcessor, the LoRA parameters are updated. However, in class Consistent_IPAttProcessor, the choice is to not update the LoRA weights. Could you explain the reason and purpose behind this decision? for module in [self.to_q_lora, self.to_k_lora, self.to_v_lora, self.to_out_lora, self.to_k_ip, self.to_v_ip]: for param in module.parameters(): param.requires_grad = False
@ngsitrong26 hi, sorry for the late reply, I've been busy with some other projects recently.
The attention of both Consistent_AttProcessor and Consistent_IPAttProcessor is used. You can check the model weights through code convert_weights.py. The weights starting with odd numbers are based on the weights obtained by Consistent_AttProcessor, and the weights starting with even numbers are based on the weights obtained by Consistent_IPAttProcessor.
This is because it is set during the training process, refer to train.py. If you have any questions and ideas, please feel free to raise them and PR.
@gaoyixuan111 Great observation, this was just for debugging purposes. We have updated the attention.py.
if cross_attention_dim is None: attn_procs[name] = Consistent_AttProcessor( hidden_size=hidden_size, cross_attention_dim=cross_attention_dim, rank=self.lora_rank, ).to(self.device, dtype=self.torch_dtype) else: attn_procs[name] = Consistent_IPAttProcessor( hidden_size=hidden_size, cross_attention_dim=cross_attention_dim, scale=1.0, rank=self.lora_rank, num_tokens=self.num_tokens, ).to(self.device, dtype=self.torch_dtype)