I'd like non-programmers to be able to use datasette-ui-extras successfully to build edit UIs. A challenge with SQLite is that its untyped/flexibly typed.
We'd like to recognize:
Dates like 2021-12-03 (eg date())
Date/times like 2021-12-03 01:02:03 (eg datetime())
Date/times with UTC flag like 2021-01-04T21:26:30Z
Date/times with more precision and no timezone: 2020-05-01T16:10:12.469751
Date/times like 2022-11-11T21:44:21+00:00
Seconds since the epoch like 1575765142 (question: is this a date or a date time? If it's a date, what TZ is it in?)
Array of strings (these should also be summarized as fixed sets)
Array of ints
Fixed sets smaller than 100 unique items
Store popularity info, and whether we observed more than 100 items
The threshold of 100 should be configurable
Proposal:
We'll collect stats on every column. We'll exclude virtual tables and tables that start with _.
CREATE TABLE _dux_column_stats(
table text not null,
column text not null,
type text not null,
nullable boolean not null,
min any,
max any,
computed_at text not null default (datetime()),
limit integer, -- Was this based on SELECT *, or SELECT * LIMIT N ?
distinct_limit int not null, -- How many distinct examples were we willing to capture?
distincts text not null, -- eg [{ value: 123.1, count: 123}]
json_each_distincts text not null, -- same shape as distincts, but the contents of JSON arrays
nulls integer not null, -- the output of COUNT(*) FILTER (WHERE TYPEOF(column) == 'null')
integers integer not null, -- as above, but integer
reals integer not null, -- as above, but real
texts integer not null, -- as above, but text
blobs integer not null, -- as above, but blob
primary key (table, column)
);
That ought to be enough for us to determine the serialization format of a column, eg whether it's seconds since the epoch or ISO timestamp with T or with space.
On startup, we'll ensure the table exists for every attached writable database. If the schema isn't exactly what we expect, we'll drop and recreate it. This is super opinionated! Maybe we'll have an opt-out knob later, but even then, it should be on by default so that it's easy to use.
We'll be able to fetch stats on demand. If there's no entry in the table, we'll do a minimal scan based on WITH small AS (SELECT "column" FROM table LIMIT 1000) SELECT ...
In the absence of stats, we'll assume that BOOLEAN is a checkbox (0/1), DATE is DATE() and DATETIME is DATETIME().
~Whenever we do a minimal scan, we'll also queue a full scan to happen in a separate thread. Basically, things should trend towards being accurate.~ To start, we'll just pick a high enough N that this generally works for common end-user scenarios.
This table should be configured as a hidden table so it does not appear for the end-user.
I'd like non-programmers to be able to use
datasette-ui-extras
successfully to build edit UIs. A challenge with SQLite is that its untyped/flexibly typed.We'd like to recognize:
2021-12-03
(egdate()
)2021-12-03 01:02:03
(egdatetime()
)2021-01-04T21:26:30Z
2020-05-01T16:10:12.469751
2022-11-11T21:44:21+00:00
1575765142
(question: is this a date or a date time? If it's a date, what TZ is it in?)Proposal:
_
.That ought to be enough for us to determine the serialization format of a column, eg whether it's seconds since the epoch or ISO timestamp with T or with space.
On startup, we'll ensure the table exists for every attached writable database. If the schema isn't exactly what we expect, we'll drop and recreate it. This is super opinionated! Maybe we'll have an opt-out knob later, but even then, it should be on by default so that it's easy to use.
We'll be able to fetch stats on demand. If there's no entry in the table, we'll do a minimal scan based on
WITH small AS (SELECT "column" FROM table LIMIT 1000) SELECT ...
In the absence of stats, we'll assume that
BOOLEAN
is a checkbox (0/1),DATE
isDATE()
andDATETIME
isDATETIME()
.~Whenever we do a minimal scan, we'll also queue a full scan to happen in a separate thread. Basically, things should trend towards being accurate.~ To start, we'll just pick a high enough N that this generally works for common end-user scenarios.
This table should be configured as a hidden table so it does not appear for the end-user.