Closed simonw closed 7 months ago
Eventually this will support many embedding models, I'll start with the new OpenAI ones that I first worked with here though:
Tested this against a copy of https://datasette.io/content.db where I ran embeddings against the pypi_packages
table:
select
name,
summary,
vector_similarity(
emb_text_embedding_3_small_512,
(
select
emb_text_embedding_3_small_512
from
pypi_packages
where
name = :name
)
) as score
from
pypi_packages
where
name != :name
order by
score desc
That's using the vector_similarity()
C function I created with ChatGPT Code Interpreter in https://simonwillison.net/2024/Mar/23/building-c-extensions-for-sqlite-with-chatgpt-code-interpreter/ - vector.dylib
file for macOS available here: https://static.simonwillison.net/static/2024/vector.dylib
I ran Datasette like this:
datasette content.db \
-p 8045 --root --secret 1 \
-s plugins.datasette-embeddings.api_key $OPENAI_API_KEY \
--load-extension vector.dylib
And did the embedding operation like this:
Also experimented with this query, to see if I could use CTE tricks to add sort by rank to the existing table pages:
with _filtered as (
select
name,
summary,
classifiers,
description,
author,
author_email,
description_content_type,
home_page,
keywords,
license,
maintainer,
maintainer_email,
package_url,
platform,
project_url,
project_urls,
release_url,
requires_dist,
requires_python,
version,
yanked,
yanked_reason,
dynamic,
provides_extra,
emb_text_embedding_3_small_512
from
pypi_packages
)
select
vector_similarity(
_filtered.emb_text_embedding_3_small_512,
(
select
emb_text_embedding_3_small_512
from
_filtered
where
name = 'airtable-export'
)
) as _score,
_filtered.*,
hex(_filtered.emb_text_embedding_3_small_512)
from
_filtered
order by
_score desc
This is called
datasette-embeddings
and notdatasette-enrichments-embeddings
because it's going to do more than just provide the enrichment - it will also provide utilities for running related-content and similar-to-text queries. I haven't decided quite how yet.