Alright, so this will be a one-sided dump of a conversation and it might be a completely stupid idea, but it's so stupid it might work.
When it comes to searching for things/people/faces...
one fun thing that's probably not super useful, but seems easy with the AI embeddings, would be text/image arithmetic. For example, searching for lion -male +female would return images of lionesses. Or img:[photo of a bike]+person would return photos of people riding bikes. đ¤ˇââď¸ Seems fun đ
Face recognition/detection, but what if it was different? Related to #46
I agree that face recognition is hard and faulty problem. I've been thinking how to tackle it, so if you don't mind indulging me for a moment.
So what I've usually seen is that face detection is a different process from face recognition. That is, with detection you know you have a million faces, but you don't have any names and only a certain confidence on the unique people those faces are from. The recognition is differentiating these faces.
Usually then what many apps do is they show you all the presumably unique faces and allow you to name them. And then since recognition is not infallible, they also allow you to accept and reject individual instances of a face to better train the model on the person. Now this is pretty standard and there are solutions for it already, so it's a safe way to go.
However! Integrating all that sounds a bit boring and I'm here to have fun, so I've been thinking of something else, which is so crazy it might work, or be a complete waste of weeks of development... But hear me out.
What if you think of the naming of a face (ie creating a person) as creating an "auto" person tag. Say that you take a reference image of the face of the person and then compute the tag by using the "related images" functionality and tagging any images that pass a similarity threshold. Maybe that would be pretty good already as a first try, but since there is only one reference image, it would probably find all kinds of other unrelated stuff.
So what if we take it one step further. Let's still have the one output auto tag, but then also have two "input" tags, one for "accepted" images and one for "rejected" ones, same as the face recognition systems record accepted and rejected faces. Then you could pick a model (eg logistic regression) to "train" on these positive and negative examples and at the end apply it to all images to get a potentially more accurate output auto face tag. Now this is just reinventing face recognition badly probably, however...
None of what I said is even specific to faces. If the CLIP AI embeddings are "expressive enough", you could theoretically have trained auto tags for your partner, your dog, for a specific bridge you often take photos of, for a certain type of cloud, for food pics, as long as you provide enough examples. Presumably the model would pick up on many cues beyond the face, like clothes and so on, so perhaps it could even detect people with obscured faces. It'd be like training (or fine tuning) small dumb AI models, but more interactively, by the user directly, and without the overhead usually associated with it. Or like "few shot detection" in ML lingo.
But I'm not an AI scientist, so it could also be a complete trash fire that works like shit. đ¤ˇââď¸ Only one way to find out
Linear regression for tags or something đ¤ˇ
With accept/reject I meant providing the ground truth, by tagging it with e.g. person:alice:accept (could also be "in" or "+") you would say that the photo definitely contains Alice in it. With alice:reject or alice:out or alice:- you would say that this photo definitely does NOT have Alice in it. These would be just normal manual tags otherwise.
Then you could have a training process that takes e.g. (alice:+, alice:-, threshold:0.3) as input parameters, removes the person:alice tag from all photos and adds it back based on the new result. So as you say you could tune the threshold and the ground truth examples in case there are too many armpits or siblings detected :)
I agree that the UX would need to be slick for this to be usable, nobody will do it if you have to manually add the tags yourself. But kind of an interactive auto refreshing results page that updates as you click to accept/reject candidates would be sweet. If you really wanted to gamify it, you could even do a Tinder-like swipe left/right to say if it's a picture of your dog or not lol.
Alright, so this will be a one-sided dump of a conversation and it might be a completely stupid idea, but it's so stupid it might work.
When it comes to searching for things/people/faces...
Face recognition/detection, but what if it was different? Related to #46
Linear regression for tags or something đ¤ˇ
To be validated.