Clip can be used to enable search for concept that haven't been explicitly tagged or for finding images that may match current tags. I can do some dev work but want to make sure that I'm able to be productive in a big project like this.
To be clear, everything I'm talking about would run locally, be very performant, and even fit into the tags metaphor.
Clip is a very "low end" AI model that generates a vector representation of both text and images in the same latent space. What this means is that trained models have been taught how to treat images and chunks of text as the same thing. Basically learning a representation between the two. (It's actually the model that powers Stable Diffusion's "text" part). They're small models with a very good ability to do things like image search WITHOUT using a full LLM.
Generating vectors can be done in less than 1 second per image on CPU (Much less with accelerators like GPU, or the AI cores in Apple Silicon) and then can be used for any future searches at > 100k images per second. (It's actually faster than comparing a string against a list of filenames in python). The generated vectors are anywhere from 512 to 1024 floats per image (1-2kb) depending on the model that we choose so they do increase the database size but the increase is manageable.
Once generated, vectors can then be compared against a text search to find matches within the set very quickly by optimized algorithms such as faiss.
You could also search by other images allowing for "similarity" searches to be done following the exact same code path.
The image-image capability actually opens up the opportunity for suggesting tags automatically as once a tag has a number of images, we could compute a vector that is shared by all the images in that tag and compare other images to see if they are substantially similar to that tag and then offer that tag as a suggestion. It could also work the other way as well where a new image can be compared against the tags and suggested tags could be listed. (Though this functionality would require other work to be done in an autotagging interface, the groundwork would be there already)
Implementation is also relatively easy and straightforward. There are well used, stable, and robust libraries for all the tasks involved that take a lot of the heavy lifting away allowing us to focus only on the integration and usage.
Clip can be used to enable search for concept that haven't been explicitly tagged or for finding images that may match current tags. I can do some dev work but want to make sure that I'm able to be productive in a big project like this.
To be clear, everything I'm talking about would run locally, be very performant, and even fit into the tags metaphor.
Clip is a very "low end" AI model that generates a vector representation of both text and images in the same latent space. What this means is that trained models have been taught how to treat images and chunks of text as the same thing. Basically learning a representation between the two. (It's actually the model that powers Stable Diffusion's "text" part). They're small models with a very good ability to do things like image search WITHOUT using a full LLM.
Generating vectors can be done in less than 1 second per image on CPU (Much less with accelerators like GPU, or the AI cores in Apple Silicon) and then can be used for any future searches at > 100k images per second. (It's actually faster than comparing a string against a list of filenames in python). The generated vectors are anywhere from 512 to 1024 floats per image (1-2kb) depending on the model that we choose so they do increase the database size but the increase is manageable.
Once generated, vectors can then be compared against a text search to find matches within the set very quickly by optimized algorithms such as faiss.
You could also search by other images allowing for "similarity" searches to be done following the exact same code path.
The image-image capability actually opens up the opportunity for suggesting tags automatically as once a tag has a number of images, we could compute a vector that is shared by all the images in that tag and compare other images to see if they are substantially similar to that tag and then offer that tag as a suggestion. It could also work the other way as well where a new image can be compared against the tags and suggested tags could be listed. (Though this functionality would require other work to be done in an autotagging interface, the groundwork would be there already)
Implementation is also relatively easy and straightforward. There are well used, stable, and robust libraries for all the tasks involved that take a lot of the heavy lifting away allowing us to focus only on the integration and usage.