key_parsing_mask_list_align why only one face part is being used?

JackAILab / ConsistentID

Customized ID Consistent for human

MIT License

726 stars 72 forks source link

key_parsing_mask_list_align why only one face part is being used? #30

Closed trofimovaolga closed 1 month ago

trofimovaolga commented 1 month ago

Hi, I'm trying to understand the code and I'd really appreciate if someone could help me understand how face parts are used. You are using 5 parts: face, eyes, ears, nose and lips (mouth). Inside of get_prepare_facemask method you are obtaining key_parsing_mask_list - a list with mask-images of each face part, but there is only one eye out of two (right of left), one lip out of two, one ear out of two. Could you please help me understand why? Further in code these parts are used as inputs for CLIPEncoder and I wonder why not all parts are used.

JackAILab commented 1 month ago

This is because most of the time, llava does not distinguish between these paired facial features when describing facial features information (for example, the descriptions of left and right ears are the same). Therefore, we merge the paired facial features descriptions together and retain only one mask among the paired facial features.

In addition, due to the large amount of training data, not all facial features can be obtained. If all facial features components are regarded as input to CLIPEncoder, there will be a large number of redundant None values. Therefore, we only send the retained facial features images to CLIPEncoder.

Of course, you can also try to inject more facial features into the model, and perhaps the ID consistency can be improved. If you have any further ideas, please feel free to ask questions and PR.

trofimovaolga commented 1 month ago

I tried to ignore this whole story with segmentation during inference and only used text encoding features without FacialEncoder. The results are pretty good, it is hard to say if it is better or worse, I'd say it is the same.

JackAILab commented 1 month ago

Your observation is precise. In fact, both FaceID and FacialEncoder can maintain ID consistency, and their simultaneous presence ensures more stable ID consistency. For more details, you can refer to the ablation experiments in the paper.

Specifically, our model introduces two types of ID consistency information. One type is FaceID, which is used to maintain the global consistency of the ID structure; however, the extraction of FaceID can be unstable. The other type is FacialEncoder, which enhances the consistency of facial features. According to experimental results on a large-scale dataset, FacialEncoder contributes more consistently to ID maintenance. This means that even when FaceID is removed, the model can still achieve more stable ID consistency with the help of FacialEncoder.

If you have any further thoughts, feel free to share and submit a PR.