Upgrade mechanism for new embeddings table

datasette / datasette-embeddings

Store and query embedding vectors in Datasette tables

Apache License 2.0

4 stars 0 forks source link

Upgrade mechanism for new embeddings table #11

Closed simonw closed 7 months ago

simonw commented 7 months ago

Split from:

simonw commented 7 months ago

Options for this:

Documentation: tell people what commands to run to upgrade
Provide a tool (maybe CLI-based) for applying this upgrade
Upgrade automatically

I thought I'd deployed this to Datasette Cloud, but I just checked and it turns out I've only used this in demos there but not made it available to all users - so the need for automatic upgrade is greatly reduced.

simonw commented 7 months ago

I am going to document this exclusively in the release notes.

simonw commented 7 months ago

If you have previously calculated embedding vectors stored in columns on an existing table, you can migrate them to the new table structure by following these steps:

Create a new shadow table to store your embeddings. This table should have the name _embeddings_TABLE where TABLE is the name of your source table, and the same primary key (or compound primary keys). It should then have a emb_xxx column for each stored embedding vector.

One way to create this table is to select a single row and embed just that - this will create the table for you while only costing you one embedding run.

Run the following SQL query to copy across your vectors:

insert or replace into _embeddings_TABLE (id, emb_xxx)
select id, emb_xxx from TABLE where emb_xxx is not null

Drop the old column: alter table TABLE drop column emb_xxx

simonw commented 7 months ago

I'm going to try those instructions against this existing table:

CREATE TABLE [releases] (
   [html_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [author] INTEGER REFERENCES [users]([id]),
   [node_id] TEXT,
   [tag_name] TEXT,
   [target_commitish] TEXT,
   [name] TEXT,
   [draft] INTEGER,
   [prerelease] INTEGER,
   [created_at] TEXT,
   [published_at] TEXT,
   [body] TEXT,
   [repo] INTEGER REFERENCES [repos]([id])
, [reactions] TEXT, [mentions_count] INTEGER, [emb_text_embedding_3_small_512] BLOB)

simonw commented 7 months ago

Confirmed: I ran this and got the correct upgrade:

CREATE TABLE [_embeddings_releases] (
   [id] INTEGER PRIMARY KEY,
[emb_text_embedding_3_small_512] BLOB);

insert or replace into _embeddings_releases (id, emb_text_embedding_3_small_512)
  select id, emb_text_embedding_3_small_512 from releases where emb_text_embedding_3_small_512 is not null;

alter table releases drop column emb_text_embedding_3_small_512;

datasette / datasette-embeddings

Upgrade mechanism for new embeddings table #11

8

3