apache / superset

Apache Superset is a Data Visualization and Data Exploration Platform
https://superset.apache.org/
Apache License 2.0
61.86k stars 13.55k forks source link

[SIP-15] Transparent and Consistent Time Intervals #6360

Closed john-bodley closed 3 years ago

john-bodley commented 5 years ago

Motivation

Some of the core tenants of any BI tool are consistency and accuracy, however we have seen instances where the time interval provides misleading or potentially incorrect results due to the nuances in how Superset currently i) uses different logic depending on the connector, and ii) mixes datetime resolutions.

The time-interval conundrum

A datetime or timestamp (time for short) interval is defined by a start and end time and whether the limits are inclusive ([, ]) or exclusive ((, )) of. This leads to four possible time interval definitions:

  1. [start, end]
  2. [start, end)
  3. (start, end)
  4. (start, end]

The underlying issue is it is not apparent to the user in the UI when they specify a time range which definition Superset is invoking. Sadly the answer depends on the connector (SQLAlchemy or Druid REST API) and in the case of SQLAlchemy engines the underlying structure of the datasource. This leads to several issues:

The later is especially concerning for chart types which obfuscate the temporal component or for time grains which use aggregation and thus potentially providing incorrect results, i.e., a user may think they are aggregating a week's worth of data (seven days) whereas the reality is they are only aggregating six or eight days due to the inclusion/exclusion logic.

Druid REST API

The Druid REST API uses the [start, end) definition (2) for time intervals although it is not explicitly mentioned in their documentation though is defined here.

SQLAlchemy

The SQLAlchemy engines use the [start, end] definition (1), i.e., the time filter limits are defined via >= and <= conditions respectively. Note however unbeknown to the user the filter may behave like (start, end] (4). Why? The reason is due to the potential mixing of dates and timestamps and using lexicographical order for evaluating clauses, i.e., assume that the time column ds is defined as a date string, then a filter of the form,

WHERE
    ds >= '2018-01-01 00:00:00' AND
    ds <= '2018-01-02 00:00:00'

for a set of ds (datestamp) values of [2018-01-01, 2018-01-02] results in,

> SELECT
    '2018-01-01' >= '2018-01-01 00:00:00',
    '2018-01-01' <= '2018-01-02 00:00:00'
FALSE TRUE

and

> SELECT
    '2018-01-02' >= '2018-01-01 00:00:00',
    '2018-01-02' <= '2018-01-02 00:00:00'
TRUE TRUE

respectively. Due to the lexicographical order the [start actually acts like (start which is probably not what the user expected.

Note this is especially problematic for relative time periods such as Last week (which is relative to today) if your time column is a datestamp as in most cases the window would only contain six (rather than seven) days of data. Why? Because the the [start, end] behaves like (start, end] and the the data associated with the end date doesn't exist yet. Additionally making the end limit inclusive is actually misleading as the times are supposed to be relative to today which implies exclusive of today.

Proposed Change

I propose there are two things we need to solve:

  1. Consistency: Ensure that all connections and datasources use the same interval definitions and that time columns are cast to a time (timestamps are preferred to strings).
  2. Transparency: Explicitly call out in the UI what the interval definition is.

Consistency

Which of the four definitions make the most sense? I propose Druid's definition of [start, end) (2) makes the most sense as it guarantees to capture the entire time interval regardless of the time resolution in the data, i.e., for SQL a 24 hour interval would be expressed as:

WHERE
    time >= TIMESTAMP '2018-01-01 00:00:00' AND
    time < TIMESTAMP '2018-01-02 00:00:00'

The reason not to opt for [start, end] (1) is this 24 hour interval could potentially be expressed as:

WHERE
    time >= TIMESTAMP '2018-01-01 00:00:00' AND
    time <= TIMESTAMP '2018-01-01 23:59:59'

however it assumes that the finest granularity that time column is defined is in seconds. In the case of milliseconds it wouldn't capture most of the last second in the 24 hour period. Also the [start, end) definition ensures that adjacent time periods are non-under/over-lapping, i.e.,

Secondly most relative time periods Last day, Last week, etc. are from today at 12:00:00 am (exceptions include things like Last 24 hours which is relative to now). What's important is that these implicitly are exclusive of the reference time , i.e., we are looking at a previous period, and hence why end) really makes the most sense.

Finally how to we address lexicographical issue caused by mixing of date and timestamp strings? Given that we explicitly define the time as time (rather than date) we should enforce that all time columns are cast/converted to a timestamp, i.e., in the case of Presto either,

WHERE
    DATE_PARSE(ds, '%Y-%m-%d') >= TIMESTAMP '2018-01-01 00:00:00' AND
    DATE_PARSE(ds, '%Y-%m-%d') < TIMESTAMP '2018-01-02 00:00:00'

(preferred) or

WHERE
    CAST(DATE_PARSE(ds, '%Y-%m-%d') AS VARCHAR) => '2018-01-01 00:00:00' AND
    CAST(DATE_PARSE(ds, '%Y-%m-%d') AS VARCHAR) < '2018-01-02 00:00:00'

work.

Transparency

I sense we need to improve the UI to explicitly call out:

  1. The interval definition.
  2. The relative time.
The interval definition

I sense a tooltip here would be suffice simply mentioning that the start is inclusive and the end is exclusive of.

The relative time

Both defaults and custom time periods are relative to some time. The custom time mentions "Relative to today" however it isn't clear if today means a time (now, 12:00:00 am, etc.) or a date. Furthermore defaults have no mention what the reference time is.

screen shot 2018-11-09 at 9 52 10 pmscreen shot 2018-11-09 at 9 52 22 pm

In most instances it's today at 12:00:00 am and there seems to be merit in explicitly calling this out. Additionally there may be merit in having an asterisk (or similar) when a relative time period is chosen, i.e., Last 7 days * to help ground the reference.

Note the mocks below are not correct when referring to relative periods where the time unit is less than a day, i.e., for any quantity of second, minute, or hour (say 48 hours) the reference time in now and thus the text should update according to the unit selected.

screen shot 2018-11-09 at 10 03 13 pm

screen shot 2018-11-09 at 9 56 44 pmscreen shot 2018-11-09 at 9 57 24 pm

Concerns

I sense there are potentially three major concerns:

  1. Migrations.
  2. Performance.
  3. Dates vs. datetimes (timestamps)
Migrations

If you asked a producer of a chart which of the four time interval definitions was being adhered to you would get the full gamut of responses, i.e., it's not evident to them exactly what the current logic is and thus it's not evident to us how we would migrate time intervals which used an absolute time for either the start or end. I sense the only solution here is to realize that this is a breaking change which though challenging provides mores transparency and consistency in the future. An institution would probably want to inform their customers of such a change via a PSA or similar.

Performance

At Airbnb our primary SQLAlchemy engine is Presto where the underlying table is partitioned by datestamp (denoted as ds). One initial concern I had was by enforcing the time column to represent a timestamp using a combination of Presto's date and time functions that a full table-scan would be required, i.e., the query planner would not be able to deduce which partitions to use, which would not be performant.

Running an EXPLAIN on the following query,

SELECT
    COUNT(1)
FROM
    <table>
WHERE
    DATE_PARSE(ds, '%Y-%m-%d') >= TIMESTAMP '2018-01-01 00:00:00' AND
    DATE_PARSE(ds, '%Y-%m-%d') < TIMESTAMP '2018-01-02 00:00:00'

results in a query plan consisting of a filter for only the 2018-01-01 ds partition,

ScanFilterProject[table = <table>, originalConstraint = (("date_parse"("ds", '%Y-%m-%d') >= "$literal$timestamp"(1514764800000)) AND ("date_parse"("ds", '%Y-%m-%d') < "$literal$t
    Cost: {rows: ?, bytes: ?}/{rows: ?, bytes: ?}/{rows: ?, bytes: ?}                                                                                                                                     
    LAYOUT: <cluster>                                                                                                                                                                                        
    ds := HiveColumnHandle{clientId=<cluster>, name=ds, hiveType=string, hiveColumnIndex=-1, columnType=PARTITION_KEY, comment=Optional.empty}                                                               
        :: [[2018-01-01]]

which means the Presto engine can correctly deduce which partitions to scan. Note I'm unsure if this holds true for all engines.

Dates vs. Datetimes (Timestamps)

Is there merit in differentiating between dates (2018-01-01) and datetimes (2018-01-01 00:00:00)? Dates are discrete whereas timestamps are continuous and thus the perception of the interval may differ. Additionally we think about date intervals we normally think of [start, end], rather than [start, end). For example Q1 is defined as 1 January – 31 March which is inclusive of both the start and end date, i.e., [01/01, 03/31] rather than [01/01, 04/01).

One could argue that Druid is correct by using the [start, end) definition as it deals with timestamps whereas SQLAlchemy datasources which are probably weighted towards dates are correct in using the [start, end] definition (excluding the issue with lexicographical ordering). There may be merit in adding explicit support for both dates and datetimes (Tableau supports both) which would require additional UI changes.

New or Changed Public Interfaces

Added clarity to the time range widget.

New dependencies

None.

Migration Plan and Compatibility

There are no planed migrations however this would be a breaking change.

Rejected Alternatives

None.

to: @betodealmeida @fabianmenges @graceguo-supercat @jeffreythewang @kristw @michellethomas @mistercrunch @timifasubaa @williaster

Fingerzam commented 5 years ago

Here's one more example where the current behavior can be surprising and undesired:

We built an aggregation pipeline where one day's data is aggregated to date_trunc('day', timestamp). Since that date_trunc results in a timestamp where the time part is 00:00:00, this results in getting an extra days results into a [start, end] time range compared to querying the original data when using a date range. Changing to [start, end) would fix this as well.

mistercrunch commented 5 years ago

+1 on relative expressions showing what they evaluate too instantly in the control

About the inclusive <= right bound, I also believe it should be exclusive. One way to do proper change management on this would be to:

That way:

john-bodley commented 5 years ago

@mistercrunch I don't believe that suggestion would work as it wouldn't be apparent to the user which of the time intervals were being enforced and thus could potentially add more confusion. I suspect we need to do something similar to the dashboard redesign rollout and banner the charts appropriately.

mistercrunch commented 5 years ago

@john-bodley I think it could be made very apparent to the user, perhaps in the control's label itself. For instance the label could show the < or <= on the label, and make it pop with bold/red when, or even have a <Alert bsStyle="danger"> when deviating from the default.

john-bodley commented 5 years ago

I was hoping to get feedback on this SIP as I feel this is a fairly egregious error and there is merit in addressing it sooner rather than later. Additionally at Airbnb we're in the process of migrating away from the Druid REST API in favor of Druid SQL and hence would like to remedy this problem prior to the migration (to ensure consistency).

Note @mistercrunch I feel that the suggestion you provided still doesn't completely alleviate the problem as it doesn't address the fact that we're mixing dates with timestamps and using lexicographical ordering which means that for SQLAlchemy datasources the time period is a mix of (1) and (4) depending on the underlying temporal data types. Additionally I wonder if introducing a mechanism for having the option to use either < or <= will add more confusion.

This may be a bitter pill to swallow, but given the complexity of the problem (unknown user intent, column datatypes, etc.) the only possible feasible solution may be to make a (somewhat severe) breaking change in an upcoming release which:

  1. Updates the SQLAlchemy model to use < rather than <=.
  2. Updates the database engine specs to use timestamps for temporal fields.
  3. Provides a banner on every chart informing users of the change (which can only be dismissed by an owner).
  4. Update the release notes which clearly informs individuals of the breaking change.

Note it would then be up to each institution to send out a PSA to their users informing them of the breaking change and inform chart owners how to modify their charts if necessary.

john-bodley commented 5 years ago

Regarding next steps I was planning on polling a number of users (mostly producers) at Airbnb via a questionnaire to gauge their understanding on how they think dates/times work with the hope that this can help inform us regarding how to remedy the situation.

vylc commented 5 years ago

An immediate need we have is not only just the clarity on what "Last week" is choosing (we have a tooltip that shows what that is), but being able to dynamically choose a last completed period. Thoughts on at least getting something like this out there as a v1:

Screen Shot 2019-04-19 at 12 42 57 PM
john-bodley commented 5 years ago

Here's some findings from a survey which was given to a group of data scientists at Airbnb. The results are based on 46 responses.

Q1. Imagine you had data for the entire month of January (encoded as dates, e.g. 2019-01-01). Which of the following intervals make most sense?

Q2. Imagine you had data for the entire month on January (encoded as times, e.g 2019-01-01 01:23:45.678). Which of the following intervals make the most sense?

Q3. Are you aware of any discrepancies with how Superset filters dates or times for SQL databases (Hive, Presto, etc.)?

Q4. Could you please outline which discrepancies you are aware of?

Q5. Are you aware of any discrepancies with how Superset filters dates or times for Druid datasources?

Q6. Could you please outline which discrepancies you are aware of?

Q7. Currently Superset handles all temporal data as time. Should Superset support both date and time types?

john-bodley commented 5 years ago

TL;DR

After researching other BI tools and surveying content creators we propose to address SIP-15 by making all temporal intervals be [start, end).

Temporal intervals

Whilst researching other BI tools it was somewhat difficult to determine via the lack of documentation how they handled intervals. To the best of my knowledge it seems that Looker uses [start, end) (see the {time} to {time} item) whereas Tableau uses [start, end]. The SQL BETWEEN operator also works as [start, end].

People generally think about dates (discrete) as inclusive start/end, i.e., if one were to read a sign "Sale ends Saturday" one would expect that the sale would be through close of business on Saturday as opposed to Friday. Times (continuous) however are different and are often thought as having an exclusive end, i.e., a meeting scheduled from 9 - 10 am really is [9:00, 10:00).

Given that we foresee time becoming more prevalent I believe it would be a mistake to make date intervals [start, end] and time intervals [start, end) as this would be quite confusing to the user and thus standardizing on one format is vital. I believe that [start, end) is the best format because:

For more detail see this post. Returning to the inclusive dates this can actually be problematic for leap years, i.e., [02-01, 02-28] may not capture the entire month of February, i.e., there is a gap whereas [02-01, 03-01) is guaranteed to fully encapsulate the month.

Note this decision is not in agreement with the majority of the survey respondents, however it is worth noting that there was support for the [start, end) intervals when considering both dates and timestamps.

We suggest the following changes:

Implementation

Changing the SQL connector from [start, end] (which again often behaves like (start, end]) to [start, end) is non-trivial, i.e., a simply migration is not feasible given it's impossible to determine whether the producer knew about the discrepancies (and adjusted their time boundaries accordingly). Also we don't want to explicitly modify the chart parameters based on what we perceive may be correct. Per the survey results only 21.7% of respondents were aware of the issue. Rather than performing a migration to address the issue we propose the following (which aims to minimize impact by providing a rollout period):

† This results in additional code logic but it seems like the complexity is worthwhile given the impact of the change and the benefit in being able to see a visual diff of the change. Note this change will mostly impact charts where the time isn't explicitly shown (bar, pie, etc.).

Given this mostly impacts instances where the granularity represents a rollup there may be merit in having a tooltip/alert on the chart to indicate that a period contains 6 or 8 days of data say as opposed to 7 (assuming one was aggregating at the weekly level).

cc: @betodealmeida @graceguo-supercat @michellethomas @mistercrunch @soboko @sylvia-tomiyama @vylc @xtinec