cldellow commented 1 year ago

Proposal

We'll monkey-patch the existing ColumnFacet, DateFacet, ArrayFacet.

A pragmatic way to get started:

[x] ui: move facets to sidebar, see #9
[x] ui: when rendering in the HTML context, don't compute facets
[x] ui: add some JS that fetches facets via the JSON API and updates the sidebar
- ideally, we only compute facets? I think we can pass nocount, nosuggest to disable a bunch fos tuff, maybe we can also pass pagesize=0? see code
  - passing _size=0&_nocount=1 is a start, although the size gets bumped up to 1 (to discover pagination? special case of 0!)
- ideally we make 1 call per facet, so we can begin rendering as soon as any data is available
- it'd be nice if we could reuse _facet_results.html -- but not necessary for an MVP
[x] ui: rewrite toggle_url - strip .json, _nocount, _size
[x] ui: support facet truncation
[x] ~ui: make facets OR (eg you can pick multiple options for each, via a checkbox)~
[x] ui: format numbers with thousands separator
[x] ui: show spinner when loading facets (...need to make facets slower to compute :)
[x] backend: make the list of __dux_facets driven by metadata and qs params
[x] ui: insert facets in order, even if responses come back out of order
[x] ~ui: add explicit column options to facet by each of the facet types~
[x] ~backend: compute facet suggestions once per table; re-use them regardless of WHERE clause, OFFSET, LIMIT, etc~
- if we could trace lineage of columns through arbitrary queries (see https://github.com/simonw/datasette/issues/1293), this could be re-used for deciding how to facet columns in such queries
- this might be as simple as tracking COUNT(*), COUNT(column), COUNT(DISTINCT column), MIN(column), MAX(column), SUM(column), TOTAL(column), AVG(column). Or it could be as complex as tracking the actual sets of distinct values (needed for #3, #14). Probably start with the simple summary statistics.
[x] backend: normalize WHERE clause to remove rowid filter when computing facet results
- might be able to do this in the UI actually, by stripping out the _next parameter
[x] ~backend: cache facet results (insight: maybe TTL or willingness to stale read should be a function of how long it took to compute the result)~

Background

Facets are really cool. They're complex and present many tradeoffs. There's no one right set of choices that can satisfy everyone. I think Datasette's current approach with them is fairly conservative, which is a sensible choice for Datasette, the platform. For my own tastes, I'd like them to behave a bit differently. And since this is a plug-in, YOLO, let's take a wildly different approach.

What I like:

Plugin interface: register_facet_classes and filters_from_request permit a wide array of shenanigans.
Plumbing: there's a ton of machinery to make facets work, to serialize parameters, to render it. I think most of this can be re-used. In fact, the exposure of facets via the JSON API will be key to enabling lazy-loading of facets.

What I'd like to adjust:

Facets being opt-in: I'd rather see all the facets, and maybe hide them if needed.
Add some extra facets: faceting by month or by year might be nice; faceting by buckets for numeric
Be able to sort facets either by frequency or by label: humans likely expect some things to be sorted by label. You might also do a blend - sort by frequency for the top 3 items, then sort by label.
Facets being above-the-fold, in-line in content: I'd rather it be in a sidebar
Eager loading blocks render of the main data: I'd be OK with them loading in after-the-fact as an Ajax call -- if they were in a sidebar, this wouldn't cause reflow of the main table.
Facets results include their filters - if you have a Country filter with 5 options, and you select 1 of them, the other 4 go away. I'd like the country selection to limit the results shown by other facets (eg the list of states), but it should still let you add other countries.
Pagination affecting facets: eg compare page 1 and page 2. IMO, the facet counts should be the same.
Facets are slow: 3 facets on global-power-plants takes ~2 seconds to render. If you enable tracing, you can see that it's the facet queries that are slow. I think there are a few culprits:
- doing a query for each facet is slow. SQLite's VM is very naive, stepping over each row is very expensive. Scanning the table N times for N columns, 1 at a time is much slower than scanning the table once for N columns.
- the facet suggestion queries are slow. In some cases, I think this is because they're buggy -- they're meant to look at the first 100 rows, but are actually doing a full table scan in the common negative case because the LIMIT clause applies after the WHERE clause, see DateFacet
- stepping back - deciding if a column is a candidate for a facet is information that never or rarely changes. Doing it on every page view seems very inefficient, especially if it blocks render.
Facet queries aren't cached - For my use case, a common scenario is landing on the main table page. This is probably 80% of views. If we cached only this, and ignored the long tail of filter permutations, we'd get a big boost. Another good reason to cache it: the main table page is unfiltered, so it's also the largest set of data, and slowest for which to compute facets
Facets aren't guaranteed to make progress - the FARA table has a different set of pages time out each time I refresh it. :( It'd be nice if the progress that was made contributed towards future refreshes, or if a background thread could make progress on it so a future refresh found it.

That's a big list! They don't all depend on each other. This ticket is primarily to explore the performance side of things.

cldellow commented 1 year ago

The facet_results for DateFacet are a good reference on how to make a facet.

I think you could suppress facet computation outside of a specific AJAX call that we use for lazy loading, so that's promising.

I suspect we could just have suggest always be a no-op, and have facet_results always return values?

cldellow commented 1 year ago

A wrinkle: SQLite will spill to disk when doing counts. A naive Python approach wouldn't, which limits the size of database this technique could work with.

Additionally, some rough benchmarking shows that the Python VM is really slow for computing statistics, even if you avoid function calls and things that look-safe-but-secretly-are-implemented-via-exceptions, like dict.get(key, default-value). So, even on a dataset that fit in memory, the runtime perf will be pretty underwhelming.

That makes me think the right approach here is to pivot a bit.

keep the heavy lifting in SQLite
be willing to cache the output of previous facet computations (in memory? in a temporary sqlite file that gets discarded? in the same db as the table we're faceting?)
normalize those queries to improve the cache hit rate (eg strip rowid > :p0 clauses)
potentially: remember when a facet can never apply. e.g. we don't need to repeatedly scan the rowid column to learn that it still doesn't contain a JSON array of tags
be able to ajax the results in, to avoid blocking render

cldellow commented 1 year ago

I think I like the idea of storing the cached values in a temporary DB that goes away when Datasette is restarted. That might keep me more honest -- this is just an optimization, nothing of value should be stored in it, and nothing should assume its existence is guaranteed.

https://www.sqlite.org/inmemorydb.html says that opening an empty string will get you a file-backed temp db that goes away when the connection closes. I like the automatic deletion...but I think its tied to a connection, so the db would be bound to a single thread.

I'd rather have it in WAL mode and be able to hit it from an arbitrary thread/process (eg: maybe we warm things up on startup).

This will mean that I'm excluding the possibility of running in "immutable" environments. That might be OK? I think many such environments offer a limited tmpfs.

Worst case scenario, we can open a shared in-memory db and use threads (vs processes) for the background worker.

cldellow commented 1 year ago

Demo of facets: html, json

I wonder... could we compute facet_suggestions once at the top level and re-use that set forever? In some cases, you may have sufficiently filtered that it's no longer an interesting facet. But that's probably OK / preferable vs wasting a lot of CPU?

cldellow commented 1 year ago

A brutal idea: track max(rowid) for rowid tables, when it changes, invalidate facets.

cldellow commented 1 year ago

Where I'm at: I can move the facets to the sidebar (good), but they're still being rendered by the server (bad).

I can do the rendering client-side if I know which facets to fetch.

Facets can come from two places:

being hardcoded in metadata.json
being suggested by a facet

I can pluck out the ones hardcoded in metadata.json and pass them via https://docs.datasette.io/en/stable/plugin_hooks.html#extra-body-script-template-database-table-columns-view-name-request-datasette

The ones suggested by a facet are harder, because we've lost the context. We could stash information on the request object?

Maybe for now we just hardcode some stuff so we can scaffold the rest of the system.

cldellow commented 1 year ago

I think faceting everything by default is too much of a change.

Instead, we'll do what DS does today and drive facets from metadata and qs params.

I still think we should just kill facet suggestion--the marginal utility of being more discoverable is not worth the perf and screen real estate hit on every request.

Instead, we'll add explicit facet options in the column dropdowns.

To start, just add an option for each facet indiscriminately. Later we can make it smart enough to only show valid choices.

cldellow commented 1 year ago

I think there isn't an official way to hook column actions yet: https://github.com/simonw/datasette/issues/983#issuecomment-752729035

We might be able to party in https://github.com/simonw/datasette/blob/main/datasette/static/table.js#L1

...although discovering which column we're connected too will be a pain. I think we'd have to parse the absolute positioning then reverse engineer which column it is

This also means mobile users won't be able to control facets, I'm ok with that

cldellow commented 1 year ago

Let's scope this down to just getting desktop feature parity.

Caching and changing to ORs can come as their own issues.

cldellow / datasette-ui-extras

re-jig facets #21

Proposal

Background