Data View: cross-field metadata and their relationship to data visualization

monfera commented 3 years ago

Keywords: metadata, field, recommender, data view, shared visual attributes, datavis best practices Examples in response to Vijay's request in moving from index patterns to Data View.

Cross-field metadata

Not all metadata neatly belong to a specific field or an entire index. Sometimes it's about relationship between two or more fields within an index or even, across indices. Examples for metadata across fields, and their utility for visual exploration:

Fields whose contents relate to one another

Hierarchical relationship between fields

One field breaks down another. Examples:

Country - State/County - City
Product group / product / SKU

It's good to know if the subunit can even be used on its own. Eg. "Paris" can be "France/Île-de-France/Paris" or "US/Texas/Paris", so, on its own, it's ambiguous, unless the City field is a unique code.

Visualizations that work well across the hierarchy:

mosaic plot (partitioning)
treemap (partitioning)
sunburst (partitioning)
tree, dendrogram, partitioning trees or partition tree
small multiples (eg. horizontal split: higher layer; color: lower layer)
various visualizations with drilldown or drill-through interactions to reach other places in the hierarchy, eg. showing US data, then descending to a specific state

Styling of hierarchical data might follow a primary breakdown, eg. also projected to color, while the deeper nodes inherit that (or fade out, like the sunburst):

Multidimensional variables

Usually, there are several discrete (categorical or ordinal) variables associated with documents. They collectively represent slicing and dicing ability (explorability, drilldown, drill-through etc.). In a given chart, usually only one (very rarely, two) can utilize a color mapping.

Functional dependency: independent variables vs dependent variables

Knowledge or inference of which field(s) determine the value of other field(s).

Examples:

a country code plus a zip code fully determines a municipality
one field, or multiple fields combined, may act as a unique key (for the document level, or a given aggregation), called a candidate key in database terms (not just SQL!), or independent variable in statistics terms, or dimensions in data exploration

Often, exploratory interaction is about filtering or navigating in the realm of independent variables / dimensions, while the quantities and categories of dependent variables are aggregated (or in contrast, disaggregated) and visualized.

Time and space dependency

Most metrics in an index may change over time, and/or spatial dimensions where available. It's useful to default to eg. a time series view or map view (recommender) and offer suitable visualization choices, eg. lines, if the time series is reasonably continuous.

Explanatory relationship

Key and text field pair:

one field is a code (eg. stable, standards or conventions following, unambiguous),
the other field is the full text for human consumption

Visualization and data exploration impact:

they should be treated as one unit by exploration interactions, eg. there should not be a separate filter dropdown for the code and the text (this still allows incremental search in either of the fields); they represent one variable eg. in a Cartesian or parallel coordinates chart
the text should show up in tooltip, legend, annotation
the text is possibly available in multiple languages and with multiple lengths, eg. to use whichever fits in a table column or as a categorical axis tick label

Redundant metrics

Certain metrics may redundantly encode the same information (eg. same phenomenon, different unit) or may contain precomputed values (eg. elapsed time, MB, MB/s).

Physical data representation changes over time

For example, user name of a given user changes; name of country changes; or an upstream logging system gets fixed. The new values may be in another field. A Data View may make the change disappear, by abstracting over. Benefits:

avoids the need to reindex a lot of data
still, visualization and report building folks don't need to introduce custom logic repeatedly (DRY principle)

Independence of metrics

If there's no established relationship among certain fields, they can be assumed independent of one another. This doesn't mean no correlation, and showing correlations is probably a good idea, eg. via scatterplot, SPLOM, parcoords.

Shared attributes

Here, multiple fields relate to one another through common properties. This can happen across fields within the same index, or among fields that are in disparate indices.

Shared nominal types (semantic domains)

While field types are present in Elasticsearch, they represent physical domains.

For example, a part to whole ratio may be represented

physically, as a float in the index
conceptually, as a real number between 0 and 1

A "megabytes transferred" metric may be represented

physically, as an integer in the index
conceptually, as an additive number, over which summing data transform aggregations, and summing visualizations eg. partition charts work

The physical type doesn't give much useful information for what transforms and visualizations may be even legitimate. Nominal (semantic) types are required for

good data visualization defaults (eg. don't offer partition charts over non-additive metrics; don't allow logarithmic Y scale if the values can be zero or negative)
legitimate recommendations, within which the topmost ones are the most compatible ones, based on metadata
meaningful visual data transform builders, where compatible pieces fit together

Nominal typing may include these, and more:

allowed extent of data (positive numbers, or numbers within a specific range)
is it a continuous measure, ie. do the numbers represent a measurement, or are they just numbers that stand for some categorization? Eg. 0 means, no error, 404 means, page not found etc. Or even, some kind of index number
discretized nature (eg. integers only, or increments of 0.2) or even, a limited set of allowed numbers or keywords
unit of measure: helps avoid adding an angle in degrees with an angle in radians; agg or report autoconversion may be possible

Note: such typing information may eventually enable more compact representation in Elasticsearch.

Several fields that reference a shared semantic type are meaningfully related. Example: both buildings_index and roads_index have a field for occupied land area. They share a unit (eg. square meters) and they share the property of additivity. These two fields may even be linked to a common metadata descriptor (DRY principle in data modeling). Therefore, a report, visualization or data transform may safely add land areas of buildings and roads, to get summarized land occupance.

Even just the knowledge of shareed, or convertible unit is useful for dataviz, because then they can be projected to a common vertical scale.

Shared visual attributes

Due to compatible nominal types,

it's possible to meaningfully union the domain, because their units are the same, or reconcilable
therefore it's sensible to map both to a common Y axis, or common color gradient

It's desirable that visual recommenders and defaults exploit common value=>aesthetic mapping when possible. Besides compatible nominal types, the default value=>aesthetic mapping can be associated with specific Data View fields, or even, across multiple Data Views.

Therefore, default mappings are first class entities which can be referenced by fields in Data Views (this still allows the implicit creation of mappings, if not shared among Data Views, for the user's convenience; can be made explicit and extracted when needed)

Multi-index Data Views

Sometimes data that relate to one another are not in the same index or index* group. Eg.

a field in the main index represents codes, while a small auxiliary index associates explanatory name with the code values
the indices describe relationships in the real world or in computer infrastructure, and specific types of entities are in their dedicated indices; one index field may reference a field in another (eg. there may be an index that associates road entities with building entities, based on which roads connect which buildings)

*A future Data View may reference multiple index (or index) entities**, with metadata in Data View associating the relationship among indices and their fields (see cross-index fields)

Derived information in Data Views

Eventually, a Data View should be able to represent an aggregation, filtering or other data transformation of its input (indices, or another, more granular Data View).

Even in this case, field level metadata is useful, per field and across fields. Because the ultimate use in visual analytics is the same, and it requires various kinds of metadata.

So, Data Views may eventually become composable. Example: different parts of the organization may need

differing granularity
authorization to different slices of the data
different default value=>aesthetic mappings eg. color scales

Even if there's a single dashboard, or a set of dashboards that share a bunch of fields, it may be worth creating a common Data View for that, atop of a possibly preexisting Data View, so that theming and mappings can be shared:

Vavaliya et al: Online Performance Assessment System for Urban Water Supply and Sanitation Services in India)

A Data View that represents data transformation actually generates metadata. For example, a grouping aggregation will yield unique rows in terms of the values in fields that are part of the grouping dimensions.

elasticmachine commented 3 years ago

Pinging @elastic/datavis (Team:DataVis)

monfera commented 3 years ago

Field metadata drives some of the recommendations: https://data.humdata.org/dataviz-guide/dataviz-elements/#/data-visualization/bar-charts ht @maartenzam

monfera commented 3 years ago

elasticmachine commented 1 year ago

Pinging @elastic/kibana-visualizations @elastic/kibana-visualizations-external (Team:Visualizations)

markov00 commented 4 weeks ago

In order to provide better transparency of priorities, issues that will not be prioritized within the next 24 months are being closed.

Tracking request in Lens general improvements ice box https://github.com/elastic/kibana/issues/184648

elastic / kibana