Closed cubiq closed 7 months ago
which face descriptor you used?
I tried a few... we could run an average maybe? dlib, MTCNN, RetinaFace are decent and pretty fast. Insighface seems to be biased since you trained with that.
the metric is 1-cos similarity”? in fact, I used another insightface model (not the training used one) to evaluate
the metric is 1-cos similarity”? in fact, I used another insightface model (not the training used one) to evaluate
I tried both euclidean and 1-cos. The numbers are of course different but the result is more or less the same.
This is euc vs 1-cos. The final result doesn't change much.
Do you get vastly different results?
the metric is 1-cos similarity”? in fact, I used another insightface model (not the training used one) to evaluate
I tried both euclidean and 1-cos. The numbers are of course different but the result is more or less the same.
This is euc vs 1-cos. The final result doesn't change much.
Do you get vastly different results?
FaceNet?
yes, facenet. Again, I've tried a few options but the result seems more or less the same. FaceID Plus v2 at weight=2 is always at the top.
Interestingly FaceIDPlus and a second pass with PlusFace or FullFace is also very effective. That makes me think that there are more combinations that we haven't explored.
You seem very interested, I'm glad about that. Please feel free to share your experience/ideas if you want.
yes, i am very interested, because a good metric is important to develop a good model.
you are right, you can also try FaceID + FaceID Plus
thresholds = { "VGG-Face": {"cosine": 0.40, "euclidean": 0.60, "euclidean_l2": 0.86}, "Facenet": {"cosine": 0.40, "euclidean": 10, "euclidean_l2": 0.80}, "Facenet512": {"cosine": 0.30, "euclidean": 23.56, "euclidean_l2": 1.04}, "ArcFace": {"cosine": 0.68, "euclidean": 4.15, "euclidean_l2": 1.13}, "Dlib": {"cosine": 0.07, "euclidean": 0.6, "euclidean_l2": 0.4}, "SFace": {"cosine": 0.593, "euclidean": 10.734, "euclidean_l2": 1.055}, "OpenFace": {"cosine": 0.10, "euclidean": 0.55, "euclidean_l2": 0.55}, "DeepFace": {"cosine": 0.23, "euclidean": 64, "euclidean_l2": 0.64}, "DeepID": {"cosine": 0.015, "euclidean": 45, "euclidean_l2": 0.17}, }
is that the minimum threshold? You set it very high. Almost only FaceID alone performs that low. At least in my testing
by the way, do you have some ideas or suggestions on improving the result, which maybe helpful to me.
is that the minimum threshold? You set it very high. Almost only FaceID alone performs that low. At least in my testing
yes, from deepface repo
in fact, I found face ID embedding is very powerful, i think I should find better training tricks l.
I have tried FaceID Plus v2 + FaceID
and it generally outperforms everything else.
Also tried FaceID Plus v2 at weight=2.5, some checkpoints react well to it but in general it's not a big difference.
I have tried
FaceID Plus v2 + FaceID
and it generally outperforms everything else.Also tried FaceID Plus v2 at weight=2.5, some checkpoints react well to it but in general it's not a big difference.
what do you think of this https://twitter.com/multimodalart/status/1742575121057841468 (multi image)
SDXL FaceID preview
in my benchmark,the cos similarity is a little better than sd 1.5 FaceID
what do you think of this https://twitter.com/multimodalart/status/1742575121057841468 (multi image)
I've seen people send multiple images trying to increase the likeliness. I'm not convinced it actually works, there's a lot of bias in "face" recognition. I will run some tests, honestly I think it's laziness. I was able to reach 0.27 likeliness with a good combination of IPAdapter models at low resolution.
Combining 2 IPAdapter models I think it's more effective than sending multiple images to the same model. But I'll make some tests.
PS: looking forward to the SDXL model!
@xiaohu2015 do you already have the code for SDXL? So I can update it and we are ready at launch :smile:
@xiaohu2015 do you already have the code for SDXL? So I can update it and we are ready at launch 😄
it same as SD 1.5 FaceID: face embedding + LoRA
but I am not sure if SDXL version really better than the SD 1.5 version, because evaluation metrics are often unreliable
okay I ran more tests, any combination of Plusv2 with any other model is definitely a winner.
These are all good:
The only other NOT v2 combination that seems to be working well is FaceIDPlus+FaceID.
I'll update the first post when I have more data
PS: I got a 0.26 today at low resolution! Looking forward to do some high resolution test :smile:
I will update SDXL model now, you can also test it
@cubiq update at https://huggingface.co/h94/IP-Adapter-FaceID#ip-adapter-faceid-sdxl
but you should convert the lora part
great thanks!
I just updated the first post with new info. Data for round 2 is here: https://docs.google.com/spreadsheets/d/1Mi2Pu9T3Hqz3Liq9Fdgs953fOD1f0mieBWUI6AN-kok/edit?usp=sharing
I'll check SDXL later :smile: and run dedicated tests on it too.
I just had a look at the key structure of the SDXL lora and it's a darn mess :smile: do you have a conversion mapping @xiaohu2015 ?
https://github.com/cubiq/ComfyUI_IPAdapter_plus/issues/145#issuecomment-1865495779
I think we can refer to this. You can find a normal sdxl lora weight and load it, print its keys, then you can get diff2ckpt
for sdxl
In the future version, lora should be not needed
the structure is pretty different and I couldn't find a relationship at first sight. But I'll check better later. I'm a bit busy this week, I might be able to work on it next Monday.
0.to_q_lora.down.weight
0.to_q_lora.up.weight
0.to_k_lora.down.weight
0.to_k_lora.up.weight
0.to_v_lora.down.weight
0.to_v_lora.up.weight
0.to_out_lora.down.weight
0.to_out_lora.up.weight
1.to_q_lora.down.weight
1.to_q_lora.up.weight
1.to_k_lora.down.weight
1.to_k_lora.up.weight
1.to_v_lora.down.weight
1.to_v_lora.up.weight
1.to_out_lora.down.weight
1.to_out_lora.up.weight
1.to_k_ip.weight
1.to_v_ip.weight
2.to_q_lora.down.weight
2.to_q_lora.up.weight
2.to_k_lora.down.weight
2.to_k_lora.up.weight
2.to_v_lora.down.weight
2.to_v_lora.up.weight
...
139.to_v_ip.weight
On SDXL
lora_unet_input_blocks_1_0_emb_layers_1.alpha
lora_unet_input_blocks_1_0_emb_layers_1.lora_down.weight
lora_unet_input_blocks_1_0_emb_layers_1.lora_up.weight
lora_unet_input_blocks_1_0_in_layers_2.alpha
lora_unet_input_blocks_1_0_in_layers_2.lora_down.weight
lora_unet_input_blocks_1_0_in_layers_2.lora_up.weight
lora_unet_input_blocks_1_0_out_layers_3.alpha
lora_unet_input_blocks_1_0_out_layers_3.lora_down.weight
lora_unet_input_blocks_1_0_out_layers_3.lora_up.weight
lora_unet_input_blocks_2_0_emb_layers_1.alpha
lora_unet_input_blocks_2_0_emb_layers_1.lora_down.weight
lora_unet_input_blocks_2_0_emb_layers_1.lora_up.weight
lora_unet_input_blocks_2_0_in_layers_2.alpha
lora_unet_input_blocks_2_0_in_layers_2.lora_down.weight
lora_unet_input_blocks_2_0_in_layers_2.lora_up.weight
lora_unet_input_blocks_2_0_out_layers_3.alpha
lora_unet_input_blocks_2_0_out_layers_3.lora_down.weight
lora_unet_input_blocks_2_0_out_layers_3.lora_up.weight
lora_unet_input_blocks_3_0_op.alpha
lora_unet_input_blocks_3_0_op.lora_down.weight
lora_unet_input_blocks_3_0_op.lora_up.weight
lora_unet_input_blocks_4_0_emb_layers_1.alpha
lora_unet_input_blocks_4_0_emb_layers_1.lora_down.weight
lora_unet_input_blocks_4_0_emb_layers_1.lora_up.weight
lora_unet_input_blocks_4_0_in_layers_2.alpha
lora_unet_input_blocks_4_0_in_layers_2.lora_down.weight
lora_unet_input_blocks_4_0_in_layers_2.lora_up.weight
lora_unet_input_blocks_4_0_out_layers_3.alpha
...
lora_unet_output_blocks_8_0_skip_connection.lora_up.weight
So it looks a little more complicated than that :smile:
@laksjdjf can you help
the structure is pretty different and I couldn't find a relationship at first sight. But I'll check better later. I'm a bit busy this week, I might be able to work on it next Monday.
0.to_q_lora.down.weight 0.to_q_lora.up.weight 0.to_k_lora.down.weight 0.to_k_lora.up.weight 0.to_v_lora.down.weight 0.to_v_lora.up.weight 0.to_out_lora.down.weight 0.to_out_lora.up.weight 1.to_q_lora.down.weight 1.to_q_lora.up.weight 1.to_k_lora.down.weight 1.to_k_lora.up.weight 1.to_v_lora.down.weight 1.to_v_lora.up.weight 1.to_out_lora.down.weight 1.to_out_lora.up.weight 1.to_k_ip.weight 1.to_v_ip.weight 2.to_q_lora.down.weight 2.to_q_lora.up.weight 2.to_k_lora.down.weight 2.to_k_lora.up.weight 2.to_v_lora.down.weight 2.to_v_lora.up.weight ... 139.to_v_ip.weight
On SDXL
lora_unet_input_blocks_1_0_emb_layers_1.alpha lora_unet_input_blocks_1_0_emb_layers_1.lora_down.weight lora_unet_input_blocks_1_0_emb_layers_1.lora_up.weight lora_unet_input_blocks_1_0_in_layers_2.alpha lora_unet_input_blocks_1_0_in_layers_2.lora_down.weight lora_unet_input_blocks_1_0_in_layers_2.lora_up.weight lora_unet_input_blocks_1_0_out_layers_3.alpha lora_unet_input_blocks_1_0_out_layers_3.lora_down.weight lora_unet_input_blocks_1_0_out_layers_3.lora_up.weight lora_unet_input_blocks_2_0_emb_layers_1.alpha lora_unet_input_blocks_2_0_emb_layers_1.lora_down.weight lora_unet_input_blocks_2_0_emb_layers_1.lora_up.weight lora_unet_input_blocks_2_0_in_layers_2.alpha lora_unet_input_blocks_2_0_in_layers_2.lora_down.weight lora_unet_input_blocks_2_0_in_layers_2.lora_up.weight lora_unet_input_blocks_2_0_out_layers_3.alpha lora_unet_input_blocks_2_0_out_layers_3.lora_down.weight lora_unet_input_blocks_2_0_out_layers_3.lora_up.weight lora_unet_input_blocks_3_0_op.alpha lora_unet_input_blocks_3_0_op.lora_down.weight lora_unet_input_blocks_3_0_op.lora_up.weight lora_unet_input_blocks_4_0_emb_layers_1.alpha lora_unet_input_blocks_4_0_emb_layers_1.lora_down.weight lora_unet_input_blocks_4_0_emb_layers_1.lora_up.weight lora_unet_input_blocks_4_0_in_layers_2.alpha lora_unet_input_blocks_4_0_in_layers_2.lora_down.weight lora_unet_input_blocks_4_0_in_layers_2.lora_up.weight lora_unet_input_blocks_4_0_out_layers_3.alpha ... lora_unet_output_blocks_8_0_skip_connection.lora_up.weight
So it looks a little more complicated than that 😄
ok, I will also upload a lora weight next week
It seems to be working pretty well together with plusface, but results are a bit random (either very good or very bad). I'll run some stats on that too.
reference image:
It is really great work! I heard that a lot of people complain about similarity of double-chin face, big face, wearing glasses etc. Is there any test for these? Or some solution for these face shapes?
@cubiq https://huggingface.co/h94/IP-Adapter-FaceID/blob/main/ip-adapter-faceid_sdxl_lora.safetensors I use https://github.com/huggingface/diffusers/blob/main/scripts/convert_diffusers_sdxl_lora_to_webui.py to convert, it should work
unfortunately that is not correct @xiaohu2015 . I'll see if I can fix it in the coming days
@cubiq https://huggingface.co/h94/IP-Adapter-FaceID/blob/main/ip-adapter-faceid_sdxl_lora.safetensors I use https://github.com/huggingface/diffusers/blob/main/scripts/convert_diffusers_sdxl_lora_to_webui.py to convert, it should work
unfortunately that is not correct @xiaohu2015 . I'll see if I can fix it in the coming days
OK
the lora weight file has been updated: https://huggingface.co/h94/IP-Adapter-FaceID/blob/main/ip-adapter-faceid_sdxl_lora.safetensors. It should work.
Can confirm that it works now. Thanks!
Can confirm that it works now. Thanks!
maybe give some cases? 😄
Input image:
Output, lora weight 1, FaceID weight 1:
Output with lora disabled, FaceID weight 1 (just to demonstrate that LoRA works and has a big impact):
In general, results do not seem to be at the level of SD1.5 FaceID plus having tested maybe 20 different input images now. This example output (the first one with lora enabled too) is better than average output.
Input image:
Output, lora weight 1, FaceID weight 1:
Output with lora disabled, FaceID weight 1 (just to demonstrate that LoRA works and has a big impact):
In general, results do not seem to be at the level of SD1.5 FaceID plus having tested maybe 20 different input images now. This example output (the first one with lora enabled too) is better than average output.
you should compare with sd 1.5 faceid. In fact, the face consistency should be better than sd 1.5.
Editing because the SDXL Lora wasn't setup properly in my workflow:
Some Quick Examples of what I've been getting, SDXL FaceID + SDXL Plus Face seems to work a little better than SD1.5 FaceID + SD1.5 Plus Face. (Both of these running with their respective LoRAs)
Input Image:
SD1.5 FaceID + SD1.5 Plus Face:
SDXL FaceID + SDXL Plus Face:
SDXL FaceID on it's own:
And then for reference an SD1.5 FaceID Plus V2:
Here's another couple with a different model
Input Image:
SDXL FaceID:
SDXL FaceID + SDXL Plus Face:
Here's another couple with a different model
Could you share your example workflow? And: can you feed an already existing Image into the workflow as the Target?
My best results are with FaceID SDXL ( with lora ) and Plus Face.
Here's another couple with a different model
Could you share your example workflow? And: can you feed an already existing Image into the workflow as the Target?
Very simple workflow here - https://pastebin.com/9n66qNg9
You can use an existing image to do img2img like normal, or you could use inpainting and only inpaint the face, I don't have an inpainting workflow though.
My best results are with FaceID SDXL ( with lora ) and Plus Face.
Very nice! What checkpoint/Loras do you use? And what was your example prompt? I still don‘t get those kind of convincing images with Juggernaut. They always look kind of „synthetic“. I have a post in the issues tab with my idea/Problem. Maybe you have a hint.
Here's another couple with a different model
Could you share your example workflow? And: can you feed an already existing Image into the workflow as the Target?
Very simple workflow here - https://pastebin.com/9n66qNg9
You can use an existing image to do img2img like normal, or you could use inpainting and only inpaint the face, I don't have an inpainting workflow though.
Thank you! Gonna try it tomorrow when I am back home. The urge to try and find out is very big…..
Very nice! What checkpoint/Loras do you use? And what was your example prompt? I still don‘t get those kind of convincing images with Juggernaut. They always look kind of „synthetic“. I have a post in the issues tab with my idea/Problem. Maybe you have a hint.
Face ID images are all Juggernaut XL 7, no loras ( except for the Face ID lora ), like on the example workflow. Juggernaut XL 8 does not work as well. Weights need to be a lot higher.
But Realism Engine SDXL v2 also worked well with Face ID. Version 3 just came out, so I haven't tried yet. https://civitai.com/models/152525/realism-engine-sdxl
The negative prompt is the one used on civitai by the Juggernaut XL creator. https://civitai.com/images/2612019
The positive prompt is from a Midjourney 6 vs 5.2 comparison video ( at 3.35 ) https://www.youtube.com/watch?v=Zl_4V0ks7CE
I think Face ID makes the SDXL results closer to v6, than v5.2. FaceID also improves skin tone and texture, and gives more complexity / realism to the facial features.
Without Face ID, these are best SDXL checkpoints for natural portraits: https://civitai.com/models/189109/photopedia-xl https://civitai.com/models/139565/realistic-stock-photo
As for Juggernaut XL, my favorite one is still version 5. It works well with this lora. https://civitai.com/models/170395/black-and-color
I've run preliminary benchmarks on SDXL. I've updated the original post.
Best checkpoints: Juggernaut XL, Realism Engine.
SDXL FaceID is better than SD1.5 FaceID. The average is 0.37 vs 0.41 of SD1.5.
@cubiq can u pls describe the process of running this tests?
what do you need to know, the generic process is explained in the "Methodology" paragraph above
@cubiq I will release the first version of sdxl plus v2, maybe you can do some comparison
@cubiq I will release the first version of sdxl plus v2, maybe you can do some comparison
I will! looking forward!
@cubiq I will release the first version of sdxl plus v2, maybe you can do some comparison
I will! looking forward!
models: https://huggingface.co/h94/IP-Adapter-FaceID/resolve/main/ip-adapter-faceid-plusv2_sdxl.bin
it is same as faceid plus v2 sd 1.5 but for sdxl
yes I can already tell that it's a lot better. Top FaceID, bottom FaceIDPlusV2 both with PlusFace added on top. I will launch some benchmarks later
:warning: Preliminary Data :warning:
Face Models Comparison
I started collecting data about all the face models available for IPAdapter. I'm generating thousands of images and comparing them with a face descriptor model. The result is subtracted to the original reference image. A value of
0
means 100% same person,1.0
completely different.BIAS! Important: please read!
The comparison is meant just as an overall help in choosing the right models. They are just numbers, they do not represent the actual image quality let alone the artistic value.
The face descriptor can be skewed by many factors and a face that is actually very good could get a low score for a number of reasons (head position, a weird shadow, ...). Don't take the following data as gospel, you still need to experiment.
Additionally the images are generated over a single pass of 30 steps. Better results could be probably achieved with a second pass and upscaling, but that would require a lot more time.
I think this data still has value to at least remove the worst offenders from your tests.
Round 1: skim the data
First step is to find the best performing checkpoints and IPAdapter face models (and face models combination). With that established we can move to the second phase which is running even more data concentrated on the best performers.
These are all the IPAdapter models that I've tested in random order, best performers are bold and will go to the next round.
These are the Checkpoints in random order, best performers are :trophy: bold.
Dreamshaper will be excluded from photo-realistic models but I will run it again with other "illustration" style checkpoints.
The preliminary data is available in a google sheet: https://docs.google.com/spreadsheets/d/1NhOBZbSPmtBY9p52PRFsSYj76XDDc65QjcRIhb8vfIE/edit?usp=sharing
Round 2: Refining the data
In this phase I took the best performers from the previous round and ran more tests. Best results bold
Basically more embeds, better results.
realisticVisionV51_v51VAE (NOT V6) Is overall the best performer but life like diffusion has often the single best result; meaning that the average is not as good as realistic vision, but sometimes you get that one result that is really good.
I tested both euclidean and 1-cosine and the result are surprisingly the same.
Since it seems that more embeddings give better results I'll also try to send multiple images of the same person to each model. I don't think it will help, but happy to be proven wrong.
The data for round 2 can be found here: https://docs.google.com/spreadsheets/d/1Mi2Pu9T3Hqz3Liq9Fdgs953fOD1f0mieBWUI6AN-kok/edit?usp=sharing
Preliminary SDXL
Combinations tested:
A the moment the best models seem to be:
Predictably V2+PlusFace again are the best performers. The best average is still .36.
Interestingly TurboVision XL performs very well.
Data: https://docs.google.com/spreadsheets/d/1hjiGB-QnKRYXTS6zTAuacRUfYUodUAdL6vZWTG4HZyc/edit?usp=sharing
Round 3: Testing multiple reference images
Processing...
Round 4: Higher resolution
Upscaling SD1.5 512×512 images is not advisable if you want to keep the likeliness as high as possible. Even using low denoise and high IPAdapter weight the base checkpoints are simply not good enough to keep the resemblance.
In my tests I lose about .5 likeliness after every upscale.
Fortunately you can still upscale SD1.5 models with SDXL FaceID + PlusFace (I used Juggernaut which is the best performer in the SDXL round). The results are very good. LifeLifeDiffusion and RealisticVision5 are still the best performers.
The average is still around 0.35 (which is lower than I'd like) but sometimes you get very good results (0.27), so it's worth running a few seeds and try with different reference images.
Result data here: https://docs.google.com/spreadsheets/d/1uVWJOcDxaEjRks-Lz0DE9A3DCCFX2qsvdpKi3bCSE2c/edit?usp=sharing
Methodology
I tried many libraries for feature extraction/face detection. In the aggregated results I find that the difference is relatively small, so at the moment I'm using Dlib and euclidean similarity. I'm trying to keep the generated images as close as possible in color/position/contrast to the original to have minimal skew.
I tried 1-consine and the results don't differ much from what is presented here so I take that the data is pretty strong. I will keep testing and update if there are any noticeable differences.
All primary embedding weights are set at .8, all secondary weights are set at .4.