Send extra information to accompany table schema

simonw commented 5 months ago

For each table:

The schema
How many rows

And for each column in the table:

How many unique values?
3 most common values (truncated) - to help it understand things like state of CA v.s. California

I should cache this as it may be expensive to calculate, especially for wide tables

YAML is a good format here.

simonw commented 5 months ago

For caching, I can stick this all in a table in _internal (because I don't want to accidentally leak details of tables that another user should not be able to see).

On server startup I'll calculate details for tables that are not yet calculated.

I'll occasionally check to see if there are new tables, including when the user attempts to use this feature.

The page that lets you select tables will always list all tables rather than just the one with cached details.

The POST handler will refresh details on any tables that need them. I'll use a time limit such that it doesn't spend too long on this, sending just the schema for tables that I fail to retrieve details for.

The common values in columns thing is the thing that's likely to take the most time - and the count of not-null values in a column.

Or... I could look at most common values in just the first 1,000 rows of a table. That way it will run fast enough that I never need to worry about time limits. Ditto if I drop the "number of not null" things - much more interesting is "was there at least one null in the first 1,000 rows".

Maybe I can compose a single SQL query that CTEs the first 1,000 rows and then uses them to gather all the details in one go?

simonw commented 5 months ago

Tried this:

with rows as (select * from "my_cool_table" limit 1000)
SELECT 'OACODE' AS column_name, 
       json_group_array(distinct OACODE) AS top_5_values
  FROM (SELECT OACODE FROM rows WHERE OACODE IS NOT NULL ORDER BY OACODE DESC LIMIT 5)
union all 
SELECT 'COUNTY_CODE' AS column_name, 
       json_group_array(distinct COUNTY_CODE) AS top_5_values
  FROM (SELECT COUNTY_CODE FROM rows WHERE COUNTY_CODE IS NOT NULL ORDER BY COUNTY_CODE DESC LIMIT 5)
union all 
SELECT 'WARD_CODE' AS column_name, 
       json_group_array(distinct WARD_CODE) AS top_5_values
  FROM (SELECT WARD_CODE FROM rows WHERE WARD_CODE IS NOT NULL ORDER BY WARD_CODE DESC LIMIT 5)
union all 
SELECT 'OA_SEQ_NUMBER' AS column_name, 
       json_group_array(distinct OA_SEQ_NUMBER) AS top_5_values
  FROM (SELECT OA_SEQ_NUMBER FROM rows WHERE OA_SEQ_NUMBER IS NOT NULL ORDER BY OA_SEQ_NUMBER DESC LIMIT 5)

Seems to work. Need to poke it a bit more.

simonw commented 5 months ago

Cobbled this together:

import sqlite3
from collections import OrderedDict

def generate_query(conn, table_name, num_examples=4, consider_top=1000):
    # get the list of columns from the table
    cursor = conn.cursor()
    cursor.execute(f"pragma table_info({table_name})")
    columns = [row[1] for row in cursor.fetchall()]

    # construct the sql query with union all for each column
    query_parts = []
    for column in columns:
        query_part = f"""
            select '{column}' as column_name, 
                   json_group_array("{column}") as examples
              from (select "{column}", count(*) as cnt
              from _rows where "{column}" is not null
              group by "{column}" order by cnt desc limit {num_examples})
        """
        query_parts.append(query_part)

    query = f'with _rows as (select * from "{table_name}" limit {consider_top}) ' + " union all ".join(query_parts)
    return query

def maybe_truncate(item, length=32):
    if not isinstance(item, (int, float)) and len(str(item)) > length:
        item = str(item)[:length] + "..."
    return item

import yaml
from collections import OrderedDict

class literal(str):
    pass

def literal_presenter(dumper, data):
    return dumper.represent_scalar("tag:yaml.org,2002:str", data, style=">")

yaml.add_representer(literal, literal_presenter)

def represent_ordereddict(dumper, data):
    value = []
    for item_key, item_value in data.items():
        node_key = dumper.represent_data(item_key)
        node_value = dumper.represent_data(item_value)
        value.append((node_key, node_value))
    return yaml.nodes.MappingNode(u"tag:yaml.org,2002:map", value)

yaml.add_representer(OrderedDict, represent_ordereddict)

table_name = 'repos'
sql = generate_query(conn, table_name)

columns = []
for column_name, json_examples in conn.execute(sql):
    examples = [maybe_truncate(example) for example in json.loads(json_examples)]
    columns.append(OrderedDict({
        "name": column_name,
        "examples": examples
    }))

print(yaml.dump(columns))

Output (truncated) looks like this:

- name: network_count
  examples:
  - 0
  - 1
  - 2
  - 4
- name: subscribers_count
  examples:
  - 1
  - 2
  - 3
  - 4
- name: readme
  examples:
  - '# yaml-to-sqlite

    [![PyPI](https...'
  - '# twitter-to-sqlite

    [![PyPI](ht...'
  - '# tableau-to-sqlite

    [![PyPI](ht...'
  - '# swarm-to-sqlite

    [![PyPI](http...'
- name: readme_html
  examples:
  - <div id="readme" class="md" data...
  - <div id="readme" class="md" data...
  - <div id="readme" class="md" data...
  - <div id="readme" class="md" data...
- name: allow_forking
  examples:
  - 1
- name: visibility
  examples:
  - public
- name: is_template
  examples:
  - 0
- name: template_repository
  examples:
  - '{"id": 400878073, "node_id": "MD...'
  - '{"id": 400878073, "node_id": "MD...'
  - '{"id": 401177473, "node_id": "MD...'
  - '{"id": 400878073, "node_id": "MD...'
- name: web_commit_signoff_required
  examples:
  - 0
- name: has_discussions
  examples:
  - 0
  - 1

simonw commented 5 months ago

Tried this against a table with 45 columns:

import time
start = time.time()

sql = generate_query(conn, table_name)

columns = []
for column_name, json_examples in conn.execute(sql):
    examples = [maybe_truncate(example) for example in json.loads(json_examples)]
    columns.append(OrderedDict({
        "name": column_name,
        "examples": examples
    }))

end = time.time()
print(end - start)
print()

0.016857147216796875 - that's probably fast enough.

simonw commented 5 months ago

Here's a better idea: instead of sending all of that information invisibly (which may not be right for users who don't want any of their data beyond the schema sent to an AI model outside of their control) have this be a visible part of the interface, so the user gets to see exactly what is about to be submitted.

Then the example values can be visibly shown, and users can decide if they want to send them or not - or if they want to anonymize them more first.

simonw commented 5 months ago

Maybe even provide a control for "add more examples".

simonw commented 4 months ago

Alternative idea for sending example rows - send one or two whole row example, but try to find rows that have the least null or blank columns possible.

Here's a query to generate a query to find the rows with the least blank columns:

SELECT 
    'SELECT
  ' || GROUP_CONCAT('(CASE WHEN ' || name || ' IS NOT NULL AND ' || name || ' != '''' THEN 1 ELSE 0 END)', ' +
  ') 
    || ' AS non_empty_count, *
  FROM ' || :table || ' ORDER BY non_empty_count DESC limit 3' AS query
FROM pragma_table_info(:table);

https://datasette.io/content?sql=SELECT+%0D%0A++++%27SELECT%0D%0A++%27+%7C%7C+GROUP_CONCAT%28%27%28CASE+WHEN+%27+%7C%7C+name+%7C%7C+%27+IS+NOT+NULL+AND+%27+%7C%7C+name+%7C%7C+%27+%21%3D+%27%27%27%27+THEN+1+ELSE+0+END%29%27%2C+%27+%2B%0D%0A++%27%29+%0D%0A++++%7C%7C+%27+AS+non_empty_count%2C+*%0D%0A++FROM+%27+%7C%7C+%3Atable+%7C%7C+%27+ORDER+BY+non_empty_count+DESC+limit+3%27+AS+query%0D%0AFROM+pragma_table_info%28%3Atable%29%3B%0D%0A&table=releases

And here's an example of running one of those generated queries:

CleanShot 2024-04-21 at 14 58 52@2x

simonw commented 4 months ago

Results from that probably still need a bit of truncation before turning them into an example.

Would need to use evals to decide if passing one or two example rows in full gives better results than passing a set of example values for each column, as in https://github.com/datasette/datasette-query-assistant/issues/2#issuecomment-2021892973

datasette / datasette-query-assistant

Send extra information to accompany table schema #2