Open gospodinbodurov opened 1 month ago
@gospodinbodurov thanks for the report! Initially I though UPSERT
wasn't support for virtual tables, as mentioned in the SQLite docs:
UPSERT does not currently work for virtual tables.
But now I think that's outdated. Other docs mention a few other APIs that may help:
So I'll try to include proper UPSERT support in an upcoming release!
You are welcome.
P.S. Not sure whether this is the right place for asking, but where I can find the different supported distance metrics? I guess at the moment cosine and l2 are the only ones?
cosine, l1, and l2. is there any other you'd be interested in?
At the moment I think the existing ones are good enough, but I am wondering how to use them - could not find their usage in the documentation.
P.S. I plan integrating this project in our platform and that's the reason I am asking about the upsert functionality and what distance functions are supported. If upsert will not be supported I will rebuild the whole database from scratch during updates.
docs are lacking but for vec0 tables you can specify a 'distance_metric=cosine' option after the vector column declaration, hoping to update the docs early next week with samples
i'll also try to get upsert support soon! dont think it'll take too long but i won't be at a computer for a few more days
@asg017 I'd also like to do query / searches based on Hamming distance of a binary quantized vector. Something like distance metric = hamming
@z3ugma if you use bit[N]
for your vector column type, then the MATCH
operator will automatically use hamming distance for you:
CREATE VIRTUAL TABLE vec_sample using vec0(
bit_embedding bit[8]
);
insert into vec_sample(rowid, bit_embedding) values
(1, vec_bit(X'FF')),
(2, vec_bit(X'00')),
(3, vec_bit(X'0F'));
select
rowid,
vec_to_json(bit_embedding),
distance
from vec_sample
where bit_embedding match vec_bit(X'FF') and k = 3;
/*
┌───────┬────────────────────────────┬──────────┐
│ rowid │ vec_to_json(bit_embedding) │ distance │
├───────┼────────────────────────────┼──────────┤
│ 1 │ '[1,1,1,1,1,1,1,1]' │ 0.0 │
│ 3 │ '[1,1,1,1,0,0,0,0]' │ 4.0 │
│ 2 │ '[0,0,0,0,0,0,0,0]' │ 8.0 │
└───────┴────────────────────────────┴──────────┘
*/
This was done because cosine/l2/l1 distance didn't make sense for binary vectors, and hamming distance was the only distance metric that makes sense for binary vectors.
Happy to support a distance_metric=
option for binary vectors if there are use-cases for other bit vector distance metrics!
This is great! Will it return simultaneous responses for an array of input query vectors?
Or does the query need to go one input embedding at a time
I tried the following query:
INSERT INTO vec_items(rowid, embedding) VALUES (?, ?) ON CONFLICT(rowid) DO UPDATE SET embedding = excluded.embedding;
and got:
UPSERT not implemented for virtual table "vec_items"
Is upsert supported in some form?