0hq / tinyvector

A tiny nearest-neighbor embedding database built with SQLite and Pytorch. (In development!)
MIT License
772 stars 24 forks source link

tinyvector logo

tinyvector - the tiny, least-dumb, speedy vector embedding database.
No, you don't need a vector database. You need tinyvector.

In pre-release: prod-ready by late-July. Still in development, not ready!

Features

Soon

Versions

🦀 tinyvector in Rust: tinyvector-rs
🐍 tinyvector in Python: tinyvector

We're better than ...

In most cases, most vector databases are overkill for something simple like:

  1. Using embeddings to chat with your documents. Most document search is nowhere close to what you'd need to justify accelerating search speed with HNSW or FAISS.
  2. Doing search for your website or store. Unless you're selling 1,000,000 items, you don't need Pinecone.
  3. Performing complex search queries on a very large database. Even if you have 2 million embeddings, this might still be the better option due to vector databases struggling with complex filtering. Tinyvector doesn't support metadata/filtering just yet, but it's very easy for you to add that yourself.

Usage

// Run the server manually:
pip install -r requirements
python -m server

// Run tests:
pip install pytest pytest-mock
pytest

Embeddings?

What are embeddings?

As simple as possible: Embeddings are a way to compare similar things, in the same way humans compare similar things, by converting text into a small list of numbers. Similar pieces of text will have similar numbers, different ones have very different numbers.

Read OpenAI's explanation.

Get involved

tinyvector is going to be growing a lot (don't worry, will still be tiny). Feel free to make a PR and contribute. If you have questions, just mention @willdepue.

Some ideas for first pulls:

Known Issues

# Major bugs:
Data corruption SQLite error? Stored vectors end up changing. Replicate by creating a table, inserting vectors, creating an index and then screwing around till an error happens. Dims end up unmatched (might be the blob functions or the norm functions most likely, but doesn't explain why the database is changing).
PCA is not tested, neither is immutable Brute Force index.

License

MIT