JackAILab / ConsistentID

Customized ID Consistent for human
MIT License
845 stars 76 forks source link

Does the checkpoint on huggingface use decoupled cross attention? #41

Closed ngsitrong26 closed 5 months ago

ngsitrong26 commented 5 months ago

if cross_attention_dim is None: attn_procs[name] = Consistent_AttProcessor( hidden_size=hidden_size, cross_attention_dim=cross_attention_dim, rank=self.lora_rank, ).to(self.device, dtype=self.torch_dtype) else: attn_procs[name] = Consistent_IPAttProcessor( hidden_size=hidden_size, cross_attention_dim=cross_attention_dim, scale=1.0, rank=self.lora_rank, num_tokens=self.num_tokens, ).to(self.device, dtype=self.torch_dtype)

vuongminh1907 commented 5 months ago

@JackAILab I am also concerned about the issue. You have text embedding and image embedding. Do you use cross-attention or decouple ? image

gaoyixuan111 commented 5 months ago

First, the hidden states after self-attention are used with the text for cross-attention. Then, the same hidden states are used with image embeds for cross-attention calculation. Finally, the results of the two attentions are added together, so there are two decoupled cross-attention operations.

vuongminh1907 commented 5 months ago

So may be the same with ip-adapter, right ?

JackAILab commented 5 months ago

hi, @vuongminh1907 , yes, ConsistentID uses the same attention decoupling structure as IPAdapter.

@ngsitrong26 , the inference demo also uses decoupled cross attention. Consistent_AttProcessor defines the attention score of multimodal text, that's important, the attention score matrix of multimodal text needs to be returned from Consistent_AttProcessor. attention_L157.

gaoyixuan111 commented 5 months ago

@JackAILab If I'm not using the ID-Preservation network functionality, can I directly use the Fine-grained Multimodal text prompts in the IP-Adapter model without redefining the attention mechanism in it?

ngsitrong26 commented 5 months ago

@JackAILab i went deeper into the source code and found that only Consistent_IPAttProcessor handles decouple cross attention, while Consistent_AttProcessor does not. So, do you use Consistent_IPAttProcessor or Consistent_AttProcessor? I also see in the checkpoint the weight of the decouple

gaoyixuan111 commented 5 months ago

@JackAILab In class Consistent_AttProcessor, the LoRA parameters are updated. However, in class Consistent_IPAttProcessor, the choice is to not update the LoRA weights. Could you explain the reason and purpose behind this decision? for module in [self.to_q_lora, self.to_k_lora, self.to_v_lora, self.to_out_lora, self.to_k_ip, self.to_v_ip]: for param in module.parameters(): param.requires_grad = False

JackAILab commented 5 months ago

@ngsitrong26 hi, sorry for the late reply, I've been busy with some other projects recently.

The attention of both Consistent_AttProcessor and Consistent_IPAttProcessor is used. You can check the model weights through code convert_weights.py. The weights starting with odd numbers are based on the weights obtained by Consistent_AttProcessor, and the weights starting with even numbers are based on the weights obtained by Consistent_IPAttProcessor. image

This is because it is set during the training process, refer to train.py. If you have any questions and ideas, please feel free to raise them and PR.

JackAILab commented 5 months ago

@gaoyixuan111 Great observation, this was just for debugging purposes. We have updated the attention.py.