dbt-labs / metricflow

MetricFlow allows you to define, build, and maintain metrics in code.
https://docs.getdbt.com/docs/build/about-metricflow
Other
1.12k stars 93 forks source link

[Feature] Support TimeZone-aware time dimension types #733

Open tlento opened 1 year ago

tlento commented 1 year ago

Is this your first time submitting a feature request?

Describe the feature

MetricFlow currently uses timezone-agnostic types for all date/time operations, but forcing people to coerce to UTC and then back-convert on render is cumbersome at best and impossible at worst. We have real-world examples of people running into date boundary issues where they have customer data stored in the customer-local timezone, and they want to compute daily customer-specific metrics with the boundaries set for that customer's local day (rather than UTC or whatever).

For situations like that it's natural to store the date/time information in local time with timezone annotations intact, as this makes common query types more natural and also allows for things like audits against local time values.

Currently , the only work-around is to normalize the time stamp in the warehouse (via the underlying dbt model) ahead of time, or else to use the expr field to do it on the fly when the semantic model is constructed. This is somewhat limiting, as it does not allow for re-use of the same measure against different time zones.

This issue will only become more common when we add hourly granularities, so it's something we'll need to address.

Note - the existing use case as described here requires us to aggregate against local time - if we do so against UTC the daily aggregations will be offset against user local time - so any strategy involving doing everything in UTC proper and then offsetting the result set in its entirety won't work, as different rows will need to be offset independently. Whether we provide support for this kind of approach via MetricFlow or some other mechanism involving dbt date/time plugins that allow MetricFlow to do simpler aggregations is an open question.

Describe alternatives you've considered

No response

Who will this benefit?

No response

Are you interested in contributing this feature?

No response

Anything else?

No response

gneius commented 1 year ago

Thanks for logging an issue to keep track of the progress. 👍

gneius commented 1 year ago

Just to add a bit of context for some of our usecases related to timezone.

For a distributed organization that have multiple Geo location, they would still store their business metrics data in a data warehouse as a single source of truth. The data stored in the data warehouse is likely normalized to a single timezone such as GMT.

However, when individual teams that consumes the data from the data warehouse from their own Geo, they would localize the query to their own timezone such as EST or PST. As a result, when sending a metrics query with a time granularity (for example day), each Geo would then use their own timezone to create the daily bucket.

Hypothetically speaking, if we are able to send a query with Dimension('metric_time').grain('DAY').timezone('EST') and a query with Dimension('metric_time').grain('DAY').timezone('PST') would result in different day boundary for each daily bucket and the final query result would be different as well.

If this usecase were to be handled directly in the data warehouse and the semantic model, we would have to create multiple time dimensions, one for each timezone that needs to be handled. This will make the semantic model quite complicated.

Hopefully this make sense.

tlento commented 11 months ago

Adding some timezone type notes here from investigations into other issues.

Most engines support the equivalent of a timestamp with time zone and timestamp without timezone equivalent. Here's the breakdown of types:

Engine ntz type tz type timestamp behavior datetime behavior
BigQuery DATETIME TIMESTAMP Timezone-aware, stores in UTC ntz
Databricks None (TIMESTAMP_NTZ is in limited support public preview) TIMESTAMP Timezone-aware, stores in UTC Not supported
DuckDB TIMESTAMP TIMESTAMP_TZ ntz ntz
Postgres timestamp without time zone timestamp with time zone ntz Not supported, it might work as ntz
Redshift TIMESTAMP TIMESTAMPTZ ntz Not supported except it seems to work as ntz
Snowflake TIMESTAMP_NTZ TIMESTAMP_TZ Session configurable, defaults to ntz ntz

Additional notes:

  1. Postgres defaults timestamp to timestamp without time zone
  2. Redshift appears to support datetime although it is not documented. Postgres might, too, it used to.
  3. Snowflake is a special, well, snowflake in that it allows session-level overrides for the behavior of timestamp.

As a general rule, converting from ntz to tz types results in a change in the value, as some form of time zone adjustment may be implicitly applied. Snowflake uses local time zone, BigQuery and Databricks assume UTC, etc.

Similarly, converting from tz -> ntz effectively truncates the time zone. Most engines also allow for a mechanism to convert from tz -> ntz with an adjustment, but the mechanism varies.

tlento commented 11 months ago

Based on the above:

  1. We should stick with using ntz types internally wherever possible, as casting into them appears to always be safe with respect to preserving the underlying timezone-free display value, which should be what people typically expect.
  2. We expect all data to be stored in the same time zone by default. Any exceptions have to be explicitly handled in the semantic model config, and addressed at query time.
  3. For timezone support we'll need a mechanism for people to request the timezone conversion, which generally seems to require specifying the target timezone - origin timezone appears to be irrelevant here, because engines make their own default assumptions about the time zones in storage (typically UTC). We can probably make that conversion on initial cast to the ntz type and carry it through, but filter predicates should be treated with care.

Note: this assumes all engines support the relevant conversion semantics in the manner we require, which needs further investigation. Snowflake and postgres appear to have particularly funky tz -> ntz conversion semantics, although it's not clear how much of this is a documentation issue.

joellabes commented 10 months ago

@dbeatty10 has spent more time thinking about timezone interoperability than anyone I've ever met - he might be a helpful person to talk to about different platforms' oddities!