cia-labs / Storage-service

General purpose storage service
GNU General Public License v3.0
10 stars 5 forks source link

Search Design. #35

Open Tejas-ChandraShekarRaju opened 8 months ago

Tejas-ChandraShekarRaju commented 8 months ago

Expectation is to come with what changes are required to system in order to enable efficient search.

Tejas-ChandraShekarRaju commented 8 months ago

@Thanmay47, With the current implementation of our storage service can you come with POC for V1 of the search functionality? Think about product use cases, Searching for what? Let's have the discussion going in the comments.

Tejas-ChandraShekarRaju commented 8 months ago

@Thanmay47 Can you add links to references we discussed for search?

Thanmay47 commented 8 months ago

@Thanmay47, With the current implementation of our storage service can you come with POC for V1 of the search functionality? Think about product use cases, Searching for what? Let's have the discussion going in the comments.

Search will come in handy when we are trying to group images by date/time, location, camera type, etc. There might be other use cases that arise later. For the time being, search filters are the minimum requirement. Exploring vector embeddings might be helpful, but it will take time to implement. If we want a quick solution, we can build a basic Document-Index search layer, that searches over a dynamic metadata document(JSON?).

Open source projects that we can either use or draw inspiration from:

  1. Meilisearch - Rust based search engine. Very fast, and has support for vector search as well. https://www.meilisearch.com/
  2. Weaviate - Vector based database and search engine. Built in Go, reasonably fast. https://weaviate.io/
Tejas-ChandraShekarRaju commented 8 months ago

Do we run inference on the images or we expect metadata during object creation? This should be as easy as possible for the user. I’m thinking what ways can users provide metadata. Does JPEG embed location, data time etc inside the image?

Thanmay47 commented 8 months ago

Afaik, metadata is embedded into the image and can be extracted fairly easily. We programatically extract image metadata and store it in a db. This metadata extraction can happen async when object storage is requested. Generally, JPEG does store image metadata. Certain compression methods can strip the metadata or rewrite it, so we need to examine the metadata to prevent pollution.

Tejas-ChandraShekarRaju commented 8 months ago

@johnkaramchand @suryamurugan If it interest's you.

Thanmay47 commented 8 months ago

Also found this - https://www.trychroma.com/

Tejas-ChandraShekarRaju commented 8 months ago

Played around with Meili search. Looks like it's great for text search.

  1. Basically you create an index for list of movies and it's details, list of companies and it's details etc.
  2. On that index you can search, Mostly super fast texted based search
  3. And gives you a screen for searching. Has both cloud and on prem support

image

Tejas-ChandraShekarRaju commented 8 months ago

@Thanmay47 we need something like this. We automatically with the provided metadata feature engineer from the image and create indexes for it. Meili search does the job, But who'll do feature engineering on the images to provide Meili with the required JSON/CSV.

@xrehpicx you were mentioning facebook had something yeah?

Thanmay47 commented 8 months ago

Okay, so we extract features ourselves(features like elephant present/not present, tusk-less elephant/elephant with tusk etc.) using classification models that automatically populate a buffer, then we update the doc(this entire process can be automated and run async). This will let us have basic search/filter functionality over most images. To implement semantic search, we will have to store autoencoder generated embeddings of each image. While searching, we have to implement some version of similarity search or clustering.

Tejas-ChandraShekarRaju commented 8 months ago

Any design suggestion for a fully functional v1?

xrehpicx commented 8 months ago

I don’t hv full context, can discuss this in college tomorrow?

On Fri, Mar 29, 2024 at 12:43 AM Tejus_CJ @.***> wrote:

@Thanmay47 https://github.com/Thanmay47 we need something like this. We automatically with the provided metadata feature engineer from the image and create indexes for it. Meili search does the job, But who'll do feature engineering on the images to provide Meili with the required JSON/CSV.

@xrehpicx https://github.com/xrehpicx you were mentioning facebook had something yeah?

— Reply to this email directly, view it on GitHub https://github.com/cia-labs/Storage-service/issues/35#issuecomment-2025925200, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFNWA2WP2FOXCVJSKWUG4QTY2RTVTAVCNFSM6AAAAABFFRTKYKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRVHEZDKMRQGA . You are receiving this because you were mentioned.Message ID: @.***>

Thanmay47 commented 8 months ago

I'm not coming today, but we can get on a call in the evening @xrehpicx. @Tejas-ChandraShekarRaju I'll try to make a complete poc for whatever I've described above.

Tejas-ChandraShekarRaju commented 8 months ago

@Thanmay47 That's great news. Looking forward for the POC. We'll try to have everyone on the call.

CC : @xrehpicx @johnkaramchand @suryamurugan @ravi-ks @yashashav_dk

Tejas-ChandraShekarRaju commented 5 months ago

https://diamond.cs.cmu.edu/whatisdiamond.html

cc : @johnkaramchand @suryamurugan @Thanmay47