Closed IntendedConsequence closed 9 months ago
this is smart but gradually lowers the resolution of the face. It's better to return an error and let the user find a better picture.
@cubiq
this is smart but gradually lowers the resolution of the face. It's better to return an error and let the user find a better picture.
It does lower the resolution of the face, but only for face detection model. The output of the face detection model is just the location of the face in the image, the bounding box and landmark points. This data is then used to crop and align the detected face in the original image, which is then sent to another model, arcface, for recognition. And arcface always downsizes all face images to the resolution of 128x128 pixels, before producing an embedding.
I'll do some testing, thanks for the heads up
okay, white padding seems to be quite effective even though not 100% fail safe. Using a lower detect res seems to impact image quality (or at least the result) quite a bit
with scaling:
With padding:
Since the only difference is hair color, I can assume that padding introduces more room for crop logic in https://github.com/deepinsight/insightface/blob/01a34cd94f7b0f4a3f6c84ce4b988668ad7be329/python-package/insightface/model_zoo/arcface_onnx.py#L66
aimg = face_align.norm_crop(img, landmark=kps, image_size=self.input_size[0])
If the image is padded, the crop is slightly bigger, which on the one hand slightly reduces the resolution of the face relative to the image size, but on the other hand it may include more hair which may result in hair color being more prominent in the arcface embedding.
This is the whole logic of it https://github.com/deepinsight/insightface/blob/01a34cd94f7b0f4a3f6c84ce4b988668ad7be329/python-package/insightface/utils/face_align.py#L27:
arcface_dst = np.array(
[[38.2946, 51.6963], [73.5318, 51.5014], [56.0252, 71.7366],
[41.5493, 92.3655], [70.7299, 92.2041]],
dtype=np.float32)
def estimate_norm(lmk, image_size=112,mode='arcface'):
assert lmk.shape == (5, 2)
assert image_size%112==0 or image_size%128==0
if image_size%112==0:
ratio = float(image_size)/112.0
diff_x = 0
else:
ratio = float(image_size)/128.0
diff_x = 8.0*ratio
dst = arcface_dst * ratio
dst[:,0] += diff_x
tform = trans.SimilarityTransform()
tform.estimate(lmk, dst)
M = tform.params[0:2, :]
return M
def norm_crop(img, landmark, image_size=112, mode='arcface'):
M = estimate_norm(landmark, image_size, mode)
warped = cv2.warpAffine(img, M, (image_size, image_size), borderValue=0.0)
return warped
The optimal solution would probably be to detect face at any cost so to speak, with gradual lowering of the detection size, but then allow growing the detected bounding box by some percentage, and give the user control of how close the crop they want - do they wish to sacrifice a bit of facial detail by including the hair color, or vice versa. ComfyUI Impact Pack has a very convenient crop factor widget that lets you control exactly that - how close you want to crop the bounding box, with 1.0 being as is, and increasing the factor would expand the crop area. It is more challenging though, since the estimate_norm returns the matrix that crops and aligns (rotates) the face, so introducing a scaled rotate+crop while keeping it centered becomes trickier.
So it is not as simple as padding the input image. It does offer an advantage of a reasonable default that should almost always work, for photos of any size, not just face crops, and it doesn't break the creativity flow by erroring out and requiring the user to fiddle with padding nodes.
The embeds change quite a bit and the likeliness seems to lower at lower resolution (it's not "just the hair"). Not sure why. Maybe internally insightface uses a low quality interpolation?
The perfect strategy would be if you sent a high res image with the full body (or half-bust) and you started detecting tentatively from a super closeup slowly "zooming out". I believe this is a task for a dedicated node not IPAdapter itself honestly.
Personally I prefer to get an error and work with a better source. Otherwise I wouldn't know what was the cause of the low quality generation. A solution would be to give the user the option to trigger an aggressive detection strategy but again, if you start from a super close up it will just be a hack
okay I applied the following strategy:
From my preliminary testing if the source is good quality 448x448 seems to be the sweet spot. If you add padding to an image that failed at 640x640 it will be detected at 512 instead of 448 (with very little quality difference). I had only a few detected at lower than 448.
Ideally this would be an option for the user. Maybe I'll make a node for that in the future.
This will appear in the next commit! Thanks for the insight!
Solves 95% of faces too close not being detected for me.