datasette / datasette-embeddings

Store and query embedding vectors in Datasette tables
Apache License 2.0
4 stars 0 forks source link

Initial plugin #1

Closed simonw closed 7 months ago

simonw commented 7 months ago

This is called datasette-embeddings and not datasette-enrichments-embeddings because it's going to do more than just provide the enrichment - it will also provide utilities for running related-content and similar-to-text queries. I haven't decided quite how yet.

simonw commented 7 months ago

Eventually this will support many embedding models, I'll start with the new OpenAI ones that I first worked with here though:

simonw commented 7 months ago

Tested this against a copy of https://datasette.io/content.db where I ran embeddings against the pypi_packages table:

CleanShot 2024-03-24 at 21 59 59@2x

select
  name,
  summary,
  vector_similarity(
    emb_text_embedding_3_small_512,
    (
      select
        emb_text_embedding_3_small_512
      from
        pypi_packages
      where
        name = :name
    )
  ) as score
from
  pypi_packages
where
  name != :name
order by
  score desc

That's using the vector_similarity() C function I created with ChatGPT Code Interpreter in https://simonwillison.net/2024/Mar/23/building-c-extensions-for-sqlite-with-chatgpt-code-interpreter/ - vector.dylib file for macOS available here: https://static.simonwillison.net/static/2024/vector.dylib

I ran Datasette like this:

datasette content.db \
  -p 8045 --root --secret 1 \
  -s plugins.datasette-embeddings.api_key $OPENAI_API_KEY \
  --load-extension vector.dylib

And did the embedding operation like this:

CleanShot 2024-03-24 at 22 09 01@2x

simonw commented 7 months ago

Also experimented with this query, to see if I could use CTE tricks to add sort by rank to the existing table pages:

with _filtered as (
  select
    name,
    summary,
    classifiers,
    description,
    author,
    author_email,
    description_content_type,
    home_page,
    keywords,
    license,
    maintainer,
    maintainer_email,
    package_url,
    platform,
    project_url,
    project_urls,
    release_url,
    requires_dist,
    requires_python,
    version,
    yanked,
    yanked_reason,
    dynamic,
    provides_extra,
    emb_text_embedding_3_small_512
  from
    pypi_packages
)
select
  vector_similarity(
    _filtered.emb_text_embedding_3_small_512,
    (
      select
        emb_text_embedding_3_small_512
      from
        _filtered
      where
        name = 'airtable-export'
    )
  ) as _score,
  _filtered.*,
  hex(_filtered.emb_text_embedding_3_small_512)
from
  _filtered
order by
  _score desc