ToTheBeginning / PuLID

[NeurIPS 2024] Official code for PuLID: Pure and Lightning ID Customization via Contrastive Alignment
Apache License 2.0
2.56k stars 178 forks source link

Flux puild utilizes multi-image input? #109

Open sipie800 opened 1 month ago

sipie800 commented 1 month ago

Hi, I read your code roughly and the pipleline combines the id embedding from insightface and a image embedding from eva. Wondering if it's possible to use multi-image input in the pipeline without retraining the model? IMO even a suboptimal solution (training-free feature fusion etc) is the chance because flux possesses great capability. flux pulid promises good result with oneshot, we may make it better with fewshot.

If it's possible theoretically, will you suggest where in the pipeline is the best position to inject the multi-embedding or something like that ?

ToTheBeginning commented 4 weeks ago

Multi-image input is feasible. There are several methods for this:

  1. Fusion after the encoder, for example, after each image passes through the encoder, it obtains 32 tokens. You can fuse these tokens by concatenation or averaging.

  2. Fusion during the encoder stage, since PuLID-FLUX uses the IDFormer, a transformer structure, as the encoder, you can feed the ViT features and ArcFace features of multiple images into the IDFormer.

sipie800 commented 4 weeks ago

Multi-image input is feasible. There are several methods for this:

  1. Fusion after the encoder, for example, after each image passes through the encoder, it obtains 32 tokens. You can fuse these tokens by concatenation or averaging.
  2. Fusion during the encoder stage, since PuLID-FLUX uses the IDFormer, a transformer structure, as the encoder, you can feed the ViT features and ArcFace features of multiple images into the IDFormer.

In method 1, if I do concatenation, will it be along dim 0(B=NxB) or dim 1(token size=Nx32) ?

And what's the semantics of the 32 tokens? Is them the 32 tokens refering to eyes, noses...or just the hierachy features from 32 vit layers ?