earth-mover / community

Community support for Arraylake
7 stars 0 forks source link

Is ArrayLake a VectorDB? #6

Open alxmrs opened 6 months ago

alxmrs commented 6 months ago

Not that EarthMover would necessarily want to be known as yet another startup within this very competitive space. However, I think it would be cool to see a comparison between other Vector DBs and what a managed Zarr dataset could do. It seems like it would be easy to put a proof of concept together with faiss or annoy.

I think approximate similarity search algorithms could be interesting for scientific use cases (can it provide better lookups than metadata based search?). Further, I like that ArrayLake + Zarr address the Cloud and State Management shaped problems while stepping aside so the ML practitioner can choose their preferred tool for similarity search.

rabernat commented 6 months ago

Thanks for the suggestion Alex! You're correct that it's fairly easy to create a vector search interface on top of Xarray + Zarr. Here's an example: https://gist.github.com/rabernat/40f53bba3a81aeb420e14872388c6fc1

In contrast to most vector DB's on the market today, all of the index building and search happen on the client side--Arraylake doesn't provide any server-side implementations for any of this. So I'd be hesitant to characterize Arraylake as a VectorDB.

alxmrs commented 6 months ago

You’re totally right. And, that’s why I like it so much! Like, you let the user bring in faiss themselves to tune an index while making the hard stuff, like transactions and concurrency, easy. Your caution makes sense, but I do see the appeal of a more DIY VectorDB. Similar to how you’ve written that the best data API is a cloud-optimized store in a bucket, I like the appeal of a simple, “serverless” embedding store.

On Mon, Jan 15, 2024 at 9:14 PM Ryan Abernathey @.***> wrote:

Thanks for the suggestion Alex! You're correct that it's fairly easy to create a vector search interface on top of Xarray + Zarr. Here's an example: https://gist.github.com/rabernat/40f53bba3a81aeb420e14872388c6fc1

In contrast to most vector DB's on the market today, all of the index building and search happen on the client side--Arraylake doesn't provide any server-side implementations for any of this. So I'd be hesitant to characterize Arraylake as a VectorDB.

— Reply to this email directly, view it on GitHub https://github.com/earth-mover/community/issues/6#issuecomment-1892155825, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARXAB5OCIBQEIC6X66B7RDYOUTSNAVCNFSM6AAAAABBZ2ECWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJSGE2TKOBSGU . You are receiving this because you authored the thread.Message ID: @.***>

rabernat commented 6 months ago

😍

Would you like to help turn my gist into a proper Python package? Could be a good project for your sabbatical? 😉

alxmrs commented 6 months ago

That sounds like a fun project. I’ll consider it, but I don’t expect to have the time 😉.

On Wed, Jan 17, 2024 at 8:41 PM Ryan Abernathey @.***> wrote:

😍

Would you like to help turn my gist into a proper Python package? Could be a good project for your sabbatical? 😉

— Reply to this email directly, view it on GitHub https://github.com/earth-mover/community/issues/6#issuecomment-1895732611, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARXAB6PSLL2T5PFWVVA6NTYO7BITAVCNFSM6AAAAABBZ2ECWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJVG4ZTENRRGE . You are receiving this because you authored the thread.Message ID: @.***>

ljstrnadiii commented 3 months ago

That would be a cool package. Lot's to figure out like how many chunks per potentially distributed/sharded index and how we would reduce. I have had great success with the ResultsHeap class in faiss to "reduce" searches over sharded index(es). I have thought though that xarray could be well suited to this type of problem.