0hq / tinyvector

A tiny nearest-neighbor embedding database built with SQLite and Pytorch. (In development!)
MIT License
772 stars 24 forks source link

High level: Rethink the embedding database structure #5

Open 0hq opened 1 year ago

0hq commented 1 year ago

tinyvector should look to be as simple as possible while being as powerful as possible. What is the best abstraction for an embedding database that minimizes the complexity and size of the codebase?

We've made a few assumptions so far:

  1. Only one index can be tied per table, since we're assuming most users won't need multiple indexes on the same data.
  2. Indexes can only be tied to a single table and cannot span multiple tables or special clauses. This might need to change in the future? Do we want to allow indexes to be built on multiple tables/with complex filtering?
  3. Indexes should try to not be mutable, instead, should force manual deletion and recreation? We may want to have a number of mutable indexes for compatibility, but it seems to be more straightforward (from a performance and a user experience perspective) to intend for most indexes to be immutable.
  4. Holding all indexes in memory and intending for vertical scaling seems like the simplest way to build tinyvector. In most common use-cases, it seems that vectors can easily be held in memory on reasonable hardware. If needed, you can do dimensionality reduction on your vectors to decrease memory impact and increase performance. Is this the right direction?