diennea / herddb

A JVM-embeddable Distributed Database
https://herddb.org
Apache License 2.0
316 stars 46 forks source link

Use JVector to index Vetors of floats - POC #814

Open eolivelli opened 1 year ago

eolivelli commented 1 year ago

This is a POC about using jvector to build an index over vectors of float.

JVector is the most advanced library to build indexes over this data type and it will be used in Cassandra 5.0.

Please note that when using the index you won't be doing a full table scan, but on the other side the results with be an "approximation", that is fine for most of the use cases, especially Vector Search for Generative AI.

This is currently a POC.

Easy things to implement:

Hard things:

The main issue is that It seems that when the index is open for writing it is always fully stored in memory, and we can flush it to disk periodically.

I cannot find a good way to not flush the index to disk, the only way I can see with the current version of JVector is to flush the index during a check point. I guess that in Cassandra there is no problem because they flush the index when the SSTable is flushed to disk and then it become immutable. In HerdDB we have long lived table-wide indexes and the paging mechanism is handle in another way: we still have immutable pages when they are flushed to disk and we have pages for indexes and indexes are flushed next to the data pages.

We will have to be creative or work with JVector folks to have more support there.

Also in is awkward that we need to store the mapping between a "nodeId" with the PK of the record out side the JVector data set. Currently we can do it with the usual BLink as we do for the PK (the PK stored a mapping bytes -> long) but if we could store the PK into the JVector we will save some coordination (an very likely also disk accesses)

To make clear that you license your contribution under the Apache License Version 2.0, January 2004 you have to acknowledge this by using the following check-box.

eolivelli commented 1 year ago

This is the PR to add jvector in Cassandra https://github.com/apache/cassandra/pull/2673/files