Open Tejas-ChandraShekarRaju opened 8 months ago
@Thanmay47, With the current implementation of our storage service can you come with POC for V1 of the search functionality? Think about product use cases, Searching for what? Let's have the discussion going in the comments.
@Thanmay47 Can you add links to references we discussed for search?
@Thanmay47, With the current implementation of our storage service can you come with POC for V1 of the search functionality? Think about product use cases, Searching for what? Let's have the discussion going in the comments.
Search will come in handy when we are trying to group images by date/time, location, camera type, etc. There might be other use cases that arise later. For the time being, search filters are the minimum requirement. Exploring vector embeddings might be helpful, but it will take time to implement. If we want a quick solution, we can build a basic Document-Index search layer, that searches over a dynamic metadata document(JSON?).
Open source projects that we can either use or draw inspiration from:
Do we run inference on the images or we expect metadata during object creation? This should be as easy as possible for the user. I’m thinking what ways can users provide metadata. Does JPEG embed location, data time etc inside the image?
Afaik, metadata is embedded into the image and can be extracted fairly easily. We programatically extract image metadata and store it in a db. This metadata extraction can happen async when object storage is requested. Generally, JPEG does store image metadata. Certain compression methods can strip the metadata or rewrite it, so we need to examine the metadata to prevent pollution.
@johnkaramchand @suryamurugan If it interest's you.
Also found this - https://www.trychroma.com/
Played around with Meili search. Looks like it's great for text search.
@Thanmay47 we need something like this. We automatically with the provided metadata feature engineer from the image and create indexes for it. Meili search does the job, But who'll do feature engineering on the images to provide Meili with the required JSON/CSV.
@xrehpicx you were mentioning facebook had something yeah?
Okay, so we extract features ourselves(features like elephant present/not present, tusk-less elephant/elephant with tusk etc.) using classification models that automatically populate a buffer, then we update the doc(this entire process can be automated and run async). This will let us have basic search/filter functionality over most images. To implement semantic search, we will have to store autoencoder generated embeddings of each image. While searching, we have to implement some version of similarity search or clustering.
Any design suggestion for a fully functional v1?
I don’t hv full context, can discuss this in college tomorrow?
On Fri, Mar 29, 2024 at 12:43 AM Tejus_CJ @.***> wrote:
@Thanmay47 https://github.com/Thanmay47 we need something like this. We automatically with the provided metadata feature engineer from the image and create indexes for it. Meili search does the job, But who'll do feature engineering on the images to provide Meili with the required JSON/CSV.
@xrehpicx https://github.com/xrehpicx you were mentioning facebook had something yeah?
— Reply to this email directly, view it on GitHub https://github.com/cia-labs/Storage-service/issues/35#issuecomment-2025925200, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFNWA2WP2FOXCVJSKWUG4QTY2RTVTAVCNFSM6AAAAABFFRTKYKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRVHEZDKMRQGA . You are receiving this because you were mentioned.Message ID: @.***>
I'm not coming today, but we can get on a call in the evening @xrehpicx. @Tejas-ChandraShekarRaju I'll try to make a complete poc for whatever I've described above.
@Thanmay47 That's great news. Looking forward for the POC. We'll try to have everyone on the call.
CC : @xrehpicx @johnkaramchand @suryamurugan @ravi-ks @yashashav_dk
https://diamond.cs.cmu.edu/whatisdiamond.html
cc : @johnkaramchand @suryamurugan @Thanmay47
Expectation is to come with what changes are required to system in order to enable efficient search.