facebookresearch / dinov2

PyTorch code and models for the DINOv2 self-supervised learning method.
Apache License 2.0
9.32k stars 829 forks source link

Is it possible to Use DINOv2 for Facial Recognition/Search #455

Open adalinama opened 3 months ago

adalinama commented 3 months ago

I have my own dataset with hundreds of thousands photos of peoples faces and was wondering if anyone has done something similar in using DINO for Facial Recognition? Hoping to be able to input a photo of a faces and to use DINO as a backend to gather all photos that contain the same person without have to do extensive labeling on dataset. Additionally I hope using DINO can help differentiate the different faces better in terms of clustering similar images together, as I find that the current solutions I have all tried have been unable to identify the same person throughout different pictures accurately, or it thinks that all people with glasses and white hair are Steven Spielberg. Not sure if I explained this well, but for example sometimes I just want to find one picture that I may have taken with a person several years ago and I only have one picture, can I use DINO? and what is a direction to get started in implementing something like this

odusseys commented 3 months ago

You are going to have a much easier time using models dedicated to face embeddings. DinoV2 is quite slow and gives extremely high dimensional outputs which are not really conducive to vector search without some kind of dimensionality reduction.

zdaiot commented 1 week ago

@odusseys Do you have any model to recommend?Thanks

1921134176 commented 6 days ago

A simple method is to use different sized backbones to extract features, and then directly use k-means for clustering. If everything goes smoothly, you can see the differences between backbones of different sizes, and then try to select backbones with smaller parameter counts to improve efficiency.

zdaiot commented 6 days ago

@1921134176 Thanks a lot. I also want to ask that dinov2 uses blurring identifiable faces during training. Does this mean that dinov2 is not suitable for face tasks?

1921134176 commented 6 days ago

In my personal opinion, dinov2 mainly aims to obtain a universal and task independent feature extraction backbone, and specific downstream adjustments still need to be made according to the domain. I am not very familiar with facial recognition, but in terms of remote sensing segmentation, we directly freeze the Dinov2 backbone and use the features extracted by Dinov2 to fine tune downstream applications, which has achieved good results. We use VIT-G.

zdaiot commented 6 days ago

Thanks a lot, How many pictures did you use for fine-tuning?

1921134176 commented 5 days ago

We trained many different task models, with the minimum task using 500 samples with 518px and the maximum task using 500000 samples with 518px. Generally, freezing training for about 10 epoch can yield a preliminary usable result. We found that frozen vit-g has high accuracy from the beginning and converges quickly during downstream application training.

zdaiot commented 3 days ago

Thanks a lot~