DeployQL / LintDB

Vector Database with support for late interaction and token level embeddings.
https://www.lintdb.com/
Apache License 2.0
53 stars 2 forks source link

Add a schema to the database #38

Closed mtbarta closed 3 months ago

mtbarta commented 3 months ago

This is a major change.

breaking changes

We introduce a schema to the database. We can index/store/filter by different data types, and we can compose different queries and ways to score.

What problem does this solve?

ColBERT and more heavyweight retrieval mechanisms can be slow, because there are more embeddings to compare per document. This makes it necessary to filter documents or iteratively reduce the amount of documents scored.

How did we solve it?

Schemas enable more flexible queries. Filtering becomes an option, and we can choose to score documents based on each matched element.

DocumentProcessor

Our main new abstraction is document processing. This has been broken out from index writing. The DocProcessor branches for each data type supported, and we optionally quantize tensors as part of this.

ColBERT fields are a special case. ColBERT is both indexed and contextual, in that we search the index but don't retrieve data from that field. During scoring, we scan the context field to get all token embeddings at once.

Scoring

Retrievers have been generalized into scoring. It's still a WIP, but we have the concept of retrieval and ranking. Combined with different types of fields, we can think of ColBERT as indexed with contextual data and XTR as indexed only.

Collections

Collections have been removed. Collections enabled an easier way to index data by passing text and automatically embedding it in LintDB. However, this conflates the main idea behind LintDB -- storing and retrieving. We see collections coming back as extensions within the Python library.

Python bindings

Python bindings were using SWIG. SWIG files used a custom syntax to define what C++ was bound to Python. This became difficult to maintain, because some of our data objects made sense to be translated to Python dictionaries. This wasn't simple to accomplish.

We've migrated to nanobind instead of SWIG. Nanobind is declared in C++ directly. There are still some growing pains with this, but it's much clearer how to define, override, or rename our bindings.

Documentation

Documentation is moving to mkdocs instead of sphinx. The main problem here was versioning our documentation. Sphinx did not have a clear enough way to handle this automatically. Mkdocs, however, has mike to version docs.

We haven't figured out all of the bugs with translating our docstrings, but fixing this seems doable.