datasette / datasette-query-assistant

Query databases and tables with AI assistance
Apache License 2.0
14 stars 2 forks source link

Retry errors up to three times #3

Closed simonw closed 4 months ago

simonw commented 5 months ago

Idea here is to use explain select ... to detect any errors in the returned SQL (without the more expensive job of actually executing it), then round-trip those back to the model as replies until it gets it right or stop after 3 goes.

simonw commented 5 months ago

Challenge: finding a prompt that triggers an error in the first place.

This one worked:

find outliers in the pypi_versions table

Against the https://datasette.io/content.db database.

CleanShot 2024-03-27 at 20 47 23@2x

simonw commented 5 months ago

Hah, it looped a bit and then returned this (after I decoded it):

http://127.0.0.1:8008/content?sql=Apologies, I should not have included the explanatory text in the SQL query. Here is the SQL query to find outliers in the pypi_versions table:

-- Find outliers in the pypi_versions table
-- Outliers are rows with id lengths that are significantly different from the majority

SELECT 
  *
FROM pypi_versions
WHERE ABS(LENGTH(id) - (
  SELECT 
    LENGTH(id)
  FROM (
    SELECT LENGTH(id), ROW_NUMBER() OVER (ORDER BY LENGTH(id)) AS rn
    FROM pypi_versions
  )
  WHERE rn = (SELECT COUNT(*) FROM pypi_versions) / 2 + 1
)) > 1.5 * (
  SELECT 
    MEDIAN(ABS(LENGTH(id) - (
      SELECT 
        LENGTH(id)
      FROM (
        SELECT LENGTH(id), ROW_NUMBER() OVER (ORDER BY LENGTH(id)) AS rn
        FROM pypi_versions
      )
      WHERE rn = (SELECT COUNT(*) FROM pypi_versions) / 2 + 1
    )))
  FROM pypi_versions
);

This query first calculates the median id length by finding the row in the middle of the sorted list of id lengths. It then calculates the median absolute deviation from the median, and selects any rows where the id length is more than 1.5 times the median absolute deviation from the median. This is a robust way to identify outliers in the pypi_versions table. -- Gave up after 3 attempts

simonw commented 5 months ago

Oh I think it's breaking because it tries to do explain -- comment goes here which Datasette rejects as not being a valid query.

simonw commented 5 months ago

I switched to Sonnet - Haiku kept returning multiple queries no matter how much I told it not to, Opus was too slow.

Very likely I could switch back to Haiku if I add a few more illustrative examples, including how to recover from an error.

simonw commented 5 months ago

I did get an interesting query out of this running my "show outliners" demo:

SELECT 
  'repos' AS table_name,
  'size' AS column_name,
  min(size) AS min_value,
  max(size) AS max_value,
  avg(size) AS avg_value,
  sum(case when size > (select avg(size) from repos)*10 then 1 else 0 end) AS num_outliers
FROM repos
UNION ALL
SELECT 
  'repos' AS table_name,
  'stargazers_count' AS column_name, 
  min(stargazers_count) AS min_value,
  max(stargazers_count) AS max_value,
  avg(stargazers_count) AS avg_value,
  sum(case when stargazers_count > (select avg(stargazers_count) from repos)*10 then 1 else 0 end) AS num_outliers 
FROM repos;

CleanShot 2024-03-27 at 22 16 35@2x