Embed media like images, audio, 3d, video or etc?

SciPhi-AI / R2R

The Supabase for RAG - R2R lets you build, scale, and manage user-facing Retrieval-Augmented Generation applications in production.

https://r2r-docs.sciphi.ai/

MIT License

2.76k stars 182 forks source link

Embed media like images, audio, 3d, video or etc? #79

Open fire opened 4 months ago

fire commented 4 months ago

Hi,

I was wondering if it was in scope to embed media?

emrgnt-cmplxty commented 4 months ago

That's definitely in scope. The best way to approach this would be to introduce the necessary embedding providers and to modify or create a new pipeline that shows an example of this in action.

I'm happy to team up on this.

fire commented 4 months ago

I have two primary usecases:

The basic use-case is taking an image and making it an embedding for use. Like stable diffusion or the various combined vision-text models. There are a few models that can also also do video.
My pet emerging technologies use-case is to take a 3d mesh from https://github.com/lucidrains/meshgpt-pytorch and have it auto complete vertices or search a database of other embedded meshes using the mesh-token-embedding.
Someday maybe: audio, speech. I am not familiar at all with this.

emrgnt-cmplxty commented 4 months ago

For image embedding, do you think we can fit it into the pipeline here [https://github.com/SciPhi-AI/R2R/blob/main/r2r/pipelines/basic/ingestion.py] with a specific embedding provider, or do you think we need to fundamentally rework the structure of the codebase in some way?

I think multi-modal is an important use case and I am very interested in figuring out how to best support this.

fire commented 4 months ago

I don't think I can drive multi-modal too much, but I'll see what spare time I can gather.

fire commented 4 months ago

The obvious question are like what happens when we have two different embedding models like token integers, how do we sync them?