asg017 / sqlite-vec

A vector search SQLite extension that runs anywhere!
Apache License 2.0
4.3k stars 136 forks source link

Integrating sqlite-vec with FTS5 for Combined Search Capabilities #48

Closed yushaw closed 4 months ago

yushaw commented 4 months ago

Hello,

I'm currently using FTS5 for full-text search in my application and am interested in adding vector-based search capabilities using sqlite-vec. I add an external table as this part describe. https://www.sqlite.org/fts5.html#external_content_and_contentless_tables

I have a few questions regarding compatibility and integration:

Compatibility: Can sqlite-vec and FTS5 be used together within the same database? Specifically, how can I manage separate indexing for text and vector data, and are there any considerations I should be aware of?

Integration: What is the recommended method for linking sqlite-vec vector tables with FTS5 indexed data? Should I use identifiers like rowid for cross-referencing?

Many thanks

asg017 commented 4 months ago

Hey @yushaw! Yes it's possible to use FTS5 tables alongside sqlite-vec, I just haven't documented it (besides this tweet).

The strategy, based on this supbase article for postgres at a high level: You'll have one fts5 virtual table and one vec0 virtual table, that share the same rowid so you can do JOINs. Insert text you want to search in the fts5 table, and also store embeddings of that text in the vec0 table.

Then when you want to perform a search on both full text and vector search, you'll use "Reciprocal Ranked Fusion. " Basically you do both queries in separate CTE steps, then 'combine' the results, ranking results that are in both higher.

Here's a full example, again mostly from that supabase article. It uses sqlite-lembed to generate embeddings, but you can just bind your vectors as JSON/BLOBs instead of using lembed.

.bail on

.load ./dist/lembed0
.load ../sqlite-vec/dist/vec0
.echo on

select sqlite_version();

INSERT INTO temp.lembed_models(key, model)
  select 'all-MiniLM-L6-v2', lembed_model_from_file('models/all-MiniLM-L6-v2-44eb4044.gguf');

create table items as select value as sentence from json_each('[
    "The dog is barking",
    "The cat is purring",
    "The bear is growling",
    "The computer is buzzing",
    "halloween weekend was a success in long beach"
]');

create virtual table fts_items using fts5(sentence);
insert into fts_items(rowid, sentence)
  select rowid, sentence from items;

create virtual table vec_items using vec0(sentence_embedding float[384]);
insert into vec_items(rowid, sentence_embedding)
  select rowid, lembed('all-MiniLM-L6-v2', sentence) from items;

.mode box
.header on

.param set :query 'california mascot'
.param set :weight_fts 1.0
.param set :weight_vec 1.0
.param set :rrf_k 60

with vec_matches as (
  select
    rowid,
    row_number() over (order by distance) as rank_number,
    distance
  from vec_items
  where sentence_embedding match  lembed('all-MiniLM-L6-v2', :query)
    and k = 3
  order by distance
),
fts_matches as (
  select
    rowid,
    row_number() over (order by rank) as rank_number,
    rank as score
  from fts_items
  where sentence match :query
),
final as (
  select
    items.rowid,
    items.sentence,
    vec_matches.distance as vector_distance,
    fts_matches.score as fts_score,
    coalesce(1.0 / (:rrf_k + fts_matches.rowid), 0.0) * :weight_fts +
      coalesce(1.0 / (:rrf_k + vec_matches.rowid), 0.0) * :weight_vec
      as combined_score

  from fts_matches
  full outer join vec_matches on vec_matches .rowid = fts_matches.rowid
  join items on coalesce(fts_matches.rowid, vec_matches.rowid) = items.rowid
  order by combined_score
)
select * from final;

Again I'll document this soon, lmk if you need more samples

yushaw commented 4 months ago

Thank you for the detailed response and example! It's great to know that they can be easily used together. I'll start implementing this strategy. Looking forward to the upcoming documentation.

Thanks again for your clearance!