Provide dimension lookup in a star/snowflake schema style

monfera commented 4 years ago

On this ticket, we could discuss the need and feasibility for dimensional lookup, to aid i18n, responsive annotation text lengths and assisting efficient, ad-hoc data exploration.

(image: Wikipedia)

Currently, textual data for display as legends, categorical axis tick labels etc. are expected to be in the data array, ie. currently we rely on denormalization. This is a good approach with Cartesian charts, especially when elastic-charts can expect a newly built denormalized array on every operation. But these items might be motivation for allowing a dimensional lookup:

the user, or their clients or audience needs internationalization, and texts for dimension codes (eg. 'United Kingdom' for 'UK') are represented in several languates, uploaded into an index
the report/visualization generally has constraints for text sizes; many systems solve it by using progressively shorter text for table headers, pie/treemap labels, categorical tick labels (also depending on horizontal vs vertical orientation) etc., for example, 'United Kingdom of Great Britain and Northern Ireland' -> 'United Kingdom' -> UK - storing just 5 languages and 3 text lengths would require 15 new fields on the document, and the fields would need to have naming conventions to encode the language/text length etc. so it'd get tedious very fast, and any change (eg. to 'Swaziland' -> 'eSwatini') would require reindexing
There may be other types of lookup, eg. deriving the country from the city code, or deriving the region from the country code; if there's a maximum of one hop, it's a star schema, if it's more, it's a snowflake schema
While denormalization works in theory, it fails once we consider ad-hoc reporting, eg. the user has a sudden interest in slicing/dicing, aggregating, annotating etc. on something that was not designed to be in the document

The discussion is needed, because

such master data joins may eventually be needed or implemented for underlying ES query languages such as ESSQL, maybe lessening the need to do so
even if a SQL JOIN were provided, there'd be technical merit to avoid re-querying some large dataset just because the user wants to now switch to another language, or responsive resizing leads to cramped space and there's less room for the table headers, tick labels or other annotations

Describe a solution A star or snowflake schema for elastic-charts: besides accepting a facts table (documents/aggregates as now) it could accept an arbitrary, typically low number of dimension tables, while also providing the means for reaching into those dimension tables via (most importantly) equality checking of dimension values against document/aggregate values (star schema) or even allow multi-hop, eg. hierarchical lookup as in the case of eg. a product nomenclature or any other breakdown often needed for treemaps, sunburst charts, table etc. visualizations

While technically, denormalization works (with potentially large to infeasible runtime cost), it's at odds with our evolving visualization algebra, because it hides and obscures how facts (dimensions or granular aggregates) relate to dimensional data. But this very information is key for the user to navigate the chart design space.

Alternatives considered

Continuing as now: denormalized everything
ES schema is not denormalized, but ESSQL obtains a JOIN: elastic-charts wouldn't need to change, but basic things eg. responsive resizing would need that ES is re-queried
Handling dimensional lookup inside elastic-charts (this ticket is biased toward this option)
A comprehensive approach in which Kibana expressions and relational joins (including client-side joins where they're preferable, and server-side joins, once ES or ESSQL supports them) are considered at the Kibana level (ie. so it's equally available to Lens, Vega, 3rd party charts etc.
Combining either or both of the latter two with crossfiltering, which also benefits from the efficiencies of not having to denormalize data

There's no rush on it as option 1 works right now, but it's good to foresee this as it often comes up in tools but it's often hard to refactor for such a general approach later, because by that time a solution could've evolved or conserved, maybe overly specialized, one-off solutions (eg. "let's add a PR for text in multiple languages", then "let's add a PR for differing text lengths" etc.). Even if we don't do anything now, we can keep the approach in mind or refer to it in other discussions.

Checklist

[x] this request is checked against already exist requests
[ ] every related Kibana issue is listed under Kibana Cross Issues list
[ ] kibana cross issue tag is associated to the issue if any kibana cross issue is present

markov00 commented 4 years ago

I totally agree on that Robert, adding a star schema for lookups is a good starting point in my opinion. I don't know how this will be handled within Kibana and ES because I don't actually see a way to query your aggregated data and have some dimensions reported into a lookup table. But I can see some use cases like: formatting/editing in place a specific text field in a visualization.

monfera commented 4 years ago

Yes, in-place editing is a good example too, I didn't think of that. For elastic-charts the fact that Kibana can't (yet?) serve a set of arrays in a snowflake relation needs not be a constraint. In fact, we're already supporting snowflake-like inputs, eg. an explicit set of colors is joined inside elastic-charts with the also user-supported data for visualization. It's not a big jump (conceptually) to go from coloring—a visual channel dimension—to eg. the name of a dimension value in language X.

Also, specific parts of Kibana, eg. Canvas, may well supply multiple inputs, Canvas, via its expression language, has capable means to do so.

monfera commented 4 years ago

I think @dsmith001's comments here are relevant too as there are examples for reusable, shareable, themeable, modifiable dimensional mappings (colors, text elements etc). These exceed what's possible with a denormalized data structure where the original data and the related data are blended together.

monfera commented 4 years ago

Star schema, snowflake schema, facts, dimensions right from here in Andy Pavlo's lecture

elastic / elastic-charts

Provide dimension lookup in a star/snowflake schema style #464