train dino with human image for human Image retrieval got bad performance

facebookresearch / dino

PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO

Apache License 2.0

6.22k stars 904 forks source link

train dino with human image for human Image retrieval got bad performance #203

Open guangdongliang opened 2 years ago

guangdongliang commented 2 years ago

when train dino with human images which have faces, the similar retrieval performance is very bad. when I check facebook-demo-link with faces like img, the website said that "Unable to process this image. Please make sure your image does not contain faces".

I wander that we can not train dino with big faces in images?

woctezuma commented 2 years ago

It is to avoid people looking up other people, e.g. as in PimEyes. See the Terms of Service:

By submitting Materials to the Demo, you represent that: (a) you are at least 18 years old (or the age of majority in the jurisdiction from which you are accessing the Demo, if such age is higher than age 18); (b) the Materials do not contain an image of your face or likeness, or an image of any other person’s face or likeness; and (c) you have the legal right to submit the Materials to the Demo and to grant the license that you grant in such Materials pursuant to these Supplemental Terms and the Terms of Service. https://ssl-demos.metademolab.com/tos

Specifically:

the Materials do not contain an image of your face or likeness, or an image of any other person’s face or likeness

guangdongliang commented 2 years ago

@woctezuma thank you for replay! Got it. the search result in the demo is of good performance. Does that ability come from dino model? if so, Is there any big training difference comparing to imagenet?

I got good performance for object and scenery. But it's bad if containing big face like img、img1 which I want to get similar cloth or facial feature. I am so confused.

Do you have any idea about that?

woctezuma commented 2 years ago

Yes, I believe the good performance comes from the DINO (Distillation with no label). It seems that features extracted by self-supervision are good to describe semantic content.

Sorry, I cannot help with your issues with faces outside of the web demo. Maybe Mathilde will pop in to check the Github issues one day.

mathildecaron31 commented 2 years ago

Hi @guangdongliang Have you managed to fix your issue ?

@woctezuma thank you so much for replying to the issues, you're doing a great job :D