This is a POC about using jvector to build an index over vectors of float.
JVector is the most advanced library to build indexes over this data type and it will be used in Cassandra 5.0.
Please note that when using the index you won't be doing a full table scan, but on the other side the results with be an "approximation", that is fine for most of the use cases, especially Vector Search for Generative AI.
This is currently a POC.
Easy things to implement:
integrate with DDL language (we need to add more space in the index metadata for all the side parameters of the index)
integrate with the Planner (detect ORDER BY .... and decide to use the Index)
Hard things:
find a way to not have the whole JVector index in memory
Implement persistent datastorage
implement checkpoint
Implement a mapping from the "nodeId" (integer) to the primary key (byte array)
implement DELETE (not supported yet in JVector)
The main issue is that It seems that when the index is open for writing it is always fully stored in memory, and we can flush it to disk periodically.
I cannot find a good way to not flush the index to disk, the only way I can see with the current version of JVector is to flush the index during a check point.
I guess that in Cassandra there is no problem because they flush the index when the SSTable is flushed to disk and then it become immutable.
In HerdDB we have long lived table-wide indexes and the paging mechanism is handle in another way: we still have immutable pages when they are flushed to disk and we have pages for indexes and indexes are flushed next to the data pages.
We will have to be creative or work with JVector folks to have more support there.
Also in is awkward that we need to store the mapping between a "nodeId" with the PK of the record out side the JVector data set. Currently we can do it with the usual BLink as we do for the PK (the PK stored a mapping bytes -> long) but if we could store the PK into the JVector we will save some coordination (an very likely also disk accesses)
To make clear that you license your contribution under
the Apache License Version 2.0, January 2004
you have to acknowledge this by using the following check-box.
This is a POC about using jvector to build an index over vectors of float.
JVector is the most advanced library to build indexes over this data type and it will be used in Cassandra 5.0.
Please note that when using the index you won't be doing a full table scan, but on the other side the results with be an "approximation", that is fine for most of the use cases, especially Vector Search for Generative AI.
This is currently a POC.
Easy things to implement:
Hard things:
The main issue is that It seems that when the index is open for writing it is always fully stored in memory, and we can flush it to disk periodically.
I cannot find a good way to not flush the index to disk, the only way I can see with the current version of JVector is to flush the index during a check point. I guess that in Cassandra there is no problem because they flush the index when the SSTable is flushed to disk and then it become immutable. In HerdDB we have long lived table-wide indexes and the paging mechanism is handle in another way: we still have immutable pages when they are flushed to disk and we have pages for indexes and indexes are flushed next to the data pages.
We will have to be creative or work with JVector folks to have more support there.
Also in is awkward that we need to store the mapping between a "nodeId" with the PK of the record out side the JVector data set. Currently we can do it with the usual BLink as we do for the PK (the PK stored a mapping bytes -> long) but if we could store the PK into the JVector we will save some coordination (an very likely also disk accesses)
To make clear that you license your contribution under the Apache License Version 2.0, January 2004 you have to acknowledge this by using the following check-box.