cldellow / datasette-ui-extras

Add editing UI and other power-user features to Datasette.
Apache License 2.0
12 stars 1 forks source link

track statistics about columns in `_dux_column_stats` #53

Closed cldellow closed 1 year ago

cldellow commented 1 year ago

I'd like non-programmers to be able to use datasette-ui-extras successfully to build edit UIs. A challenge with SQLite is that its untyped/flexibly typed.

We'd like to recognize:

Proposal:

  1. We'll collect stats on every column. We'll exclude virtual tables and tables that start with _.
CREATE TABLE _dux_column_stats(
  table text not null,
  column text not null,
  type text not null,
  nullable boolean not null,
  min any,
  max any,
  computed_at text not null default (datetime()),
  limit integer, -- Was this based on SELECT *, or SELECT * LIMIT N ?
  distinct_limit int not null, -- How many distinct examples were we willing to capture?
  distincts text not null, -- eg [{ value: 123.1, count: 123}]
  json_each_distincts text not null, -- same shape as distincts, but the contents of JSON arrays
  nulls integer not null, -- the output of COUNT(*) FILTER (WHERE TYPEOF(column) == 'null')
  integers integer not null, -- as above, but integer
  reals integer not null, -- as above, but real
  texts integer not null, -- as above, but text
  blobs integer not null, -- as above, but blob
  primary key (table, column)
);

That ought to be enough for us to determine the serialization format of a column, eg whether it's seconds since the epoch or ISO timestamp with T or with space.

  1. On startup, we'll ensure the table exists for every attached writable database. If the schema isn't exactly what we expect, we'll drop and recreate it. This is super opinionated! Maybe we'll have an opt-out knob later, but even then, it should be on by default so that it's easy to use.

  2. We'll be able to fetch stats on demand. If there's no entry in the table, we'll do a minimal scan based on WITH small AS (SELECT "column" FROM table LIMIT 1000) SELECT ...

  3. In the absence of stats, we'll assume that BOOLEAN is a checkbox (0/1), DATE is DATE() and DATETIME is DATETIME().

  4. ~Whenever we do a minimal scan, we'll also queue a full scan to happen in a separate thread. Basically, things should trend towards being accurate.~ To start, we'll just pick a high enough N that this generally works for common end-user scenarios.

  5. This table should be configured as a hidden table so it does not appear for the end-user.