Open sipie800 opened 1 month ago
Multi-image input is feasible. There are several methods for this:
Fusion after the encoder, for example, after each image passes through the encoder, it obtains 32 tokens. You can fuse these tokens by concatenation or averaging.
Fusion during the encoder stage, since PuLID-FLUX uses the IDFormer, a transformer structure, as the encoder, you can feed the ViT features and ArcFace features of multiple images into the IDFormer.
Multi-image input is feasible. There are several methods for this:
- Fusion after the encoder, for example, after each image passes through the encoder, it obtains 32 tokens. You can fuse these tokens by concatenation or averaging.
- Fusion during the encoder stage, since PuLID-FLUX uses the IDFormer, a transformer structure, as the encoder, you can feed the ViT features and ArcFace features of multiple images into the IDFormer.
In method 1, if I do concatenation, will it be along dim 0(B=NxB) or dim 1(token size=Nx32) ?
And what's the semantics of the 32 tokens? Is them the 32 tokens refering to eyes, noses...or just the hierachy features from 32 vit layers ?
Hi, I read your code roughly and the pipleline combines the id embedding from insightface and a image embedding from eva. Wondering if it's possible to use multi-image input in the pipeline without retraining the model? IMO even a suboptimal solution (training-free feature fusion etc) is the chance because flux possesses great capability. flux pulid promises good result with oneshot, we may make it better with fewshot.
If it's possible theoretically, will you suggest where in the pipeline is the best position to inject the multi-embedding or something like that ?