CartoDB / carto-vl

CARTO VL: a Javascript library to create vector-based visualizations
BSD 3-Clause "New" or "Revised" License
129 stars 26 forks source link

Temporal aggregations #551

Closed jgoizueta closed 6 years ago

jgoizueta commented 6 years ago

This originated in #130 In that ticket, @davidmanzanares said:

Since timestamps can have very high resolutions, but most of the times we don't need those, we should have a mechanism for aggregations. After implementing the naive version (without aggregations, setting the timestamp in dimensions), we could add new aggregation functions that only works with timestamps, for example: HOUR(), DAY(), WEEK(), MIN(), SEC(), MSEC())

We need 3 things here:

We should also considere a similar mechanism for quantization (in buckets, ranges, ...) of non-time properties to be sent as aggregation dimensions in the backend.

jgoizueta commented 6 years ago

Dimension Granularity Proposal

The idea is to allow defining the granularity (level of detail) of columns used as dimensions of the aggregation, i.e. columns which we don't want to summarize (place inside cluster aggregation functions such as clusterAvg()).

For example, given a timestamp column with millisecond resolutions, we may want to obtain it as years, months, weeks, days, hours, etc. to use for animation.

In torque manage the granularity by specifying a number of buckets, but when using time values it is much more convenient to emply the common units of time.

We'll need to expand the aggregation API in the tiler to support this and define a way to use this feature in Carto VL.

Client Side (CARTO VL API)

I propose to implement expressions to define the granularity of a column, e.g. timeInHours($t), in a similar way to clusterAvg and the other cluster aggregations:

I'm not considering any alternative, such as variable/property declarations to define the granularity, because similar alternatives were discussed and dismissed for cluster aggregation function.

The use of a timeInHours($t) expresion introduces a new column t_hours defined as a SQL expression (which returns an epoch in hours) which would also be included in the GROUP BY of the aggregation.

We could have similar functions for numeric columns e.g. to classify by break points or number of buckets.

If the instantiation reports that data won't be aggregated and any of these expressions is used, it should fail (as for cluster aggr functions).

Naming

We should define a consistent terminology for these expressions which makes its intent as clear as possible.

Some options I can think of:

Backend (Maps API)

The current dimensions parameter of aggregation would be expanded to accept a parameterized form of the discretization options, e.g.

{
  aggregation: {
    dimensions: {
      t_in_hours: {
        column: 't',
        granularity: 'hours'
      }
    }
  }
}

This would generate this kind of SQL wrapping:

SELECT date_part('epoch', date_trunc('hour', t))/3600 AS t_in_hours, ...
  FROM ...
  GROUP BY 1, ...

We should decide which format is more convenient for the resulting values:

The first format could be used directly by CARTO VL; the others would require some conversion on the client part.

jgoizueta commented 6 years ago

A couple of ideas from @davidmanzanares:

jgoizueta commented 6 years ago

We could have some auxiliary expressions like:

davidmanzanares commented 6 years ago

Yes, I think this is the way to go.

Regarding API I like clusterHours($date), clusterYears($date) or maybe in singular (like in Julia: https://docs.julialang.org/en/stable/stdlib/Dates/index.html ).

Regarding the ability to define the number of buckets, maybe we could specify the minimum duration / granularity instead of the number of buckets, for example: clusterSeconds($date, 4) meaning aggregate $date in time intervals of 4 seconds. Floating point values could be used to: clusterSeconds($date, 0.1) aggregate at tenths of a second.

Regarding the type of the returned values, I would stick to floating point values in seconds from the minimum value:

decodedDate = property.minimumDate + decodedValue

The reason for not using strings is space and the reason for not using integers is the ability to use aggs like clusterSeconds($date, 0.1). Also, usage of floating point allows us to use always seconds without problems since in the case we are dealing with huge values like decades the exponent is going to save us from overflow while keeping a good precision (relative to that magnitudes).

jgoizueta commented 6 years ago

I like the simple names (not in singular, because to me it seems that hour(t) extracts the hour of the time rather than giving the time in hours). But I thought that, just like in the clusterX functions we should use a common prefix to make evident these are special functions with similar behaviour.

And I really like the hours($t, 4) idea, and also agree with giving all values in seconds from the min.

jgoizueta commented 6 years ago

After talking with @davidmanzanares : in some cases it could be interesting to be able to obtain the hour of day, day of week or month of the year (as opposed to the time in hours, days or months). So we could have expressions for that. A possibility would be to use singular (hour, day, month) for it, but would probably be too confusing.

jgoizueta commented 6 years ago

After conversation with @rochoa & @davidmanzanares

The case of aggregations by the hours of the day, hours of the week, days of the week, months of the year and day of the year are very common and we should focus on providing a good solution for them.

In all these cases we have unit of granularity and another unit which acts as a cycle (we compute the time in the first units modulo the second units).

So, we'd like to support three different kinds of cases, and we could do it with expressions with three arguments (two of them optional):

jgoizueta commented 6 years ago

The SQL expressions will get quite complex when the second argument N is not 1 and irregular units are involved (minutes, hours, days and weeks are regular in that they're always a fixed number of seconds, but months and years are not).

When the third argument (cycle) is present I think It can be more meaningful to return the time in the corresponding units (e.g. months 1 to 12 for clusterMonths($t, 1, 'year')), and it can also simplify the expressions for irregular units.

jgoizueta commented 6 years ago

After experimenting with a branch of the tiler which has an API for time dimensions, and return the values as ordinal values in the requested units, rather than epoch numbers (e.g. for days of the week it returns values 0 to 6; for days of the month 1 to 31, etc). I've detected some issues that we should discuss:

The current backend metadata includes only values (min, max, ...) for the un-aggregated query. Including data about aggregated columns is not straightforward because the values depend on the zoom level, but information about the classified dimensions is not a problem.

Now we map date columns (as epoch numbers) to values in the range [0, 1]. This is not valid for classified time dimensions if we used the units described above. I see to options to handle them:

What do you think @davidmanzanares ?

Should we reconsider using epoch numbers for all time classifications? That would simplify the mapping the values internally, but we would need to add complexity to provide tools to represent the values in a natural form (epochs are not great for days for the week).

Then, if we go on with the current implementation (ordinals), is the categories approach a valid option? Otherwise I'll add backend metadata options to request stats about aggregation dimensions.

jgoizueta commented 6 years ago

Regarding the last comment, and after talking with @davidmanzanares it seems simpler to use the metadata we have now and derive the dimension limits from it (for cyclics, use the theoretical limits).

That requires us to replicate some time logic in the client, but I guess it's not too bad: https://github.com/CartoDB/carto-vl/blob/99a362941c4e58007b50d97300ea0b95c7e42f78/src/client/WindshaftWorker.js#L27-L75

jgoizueta commented 6 years ago

Currently pending decisions:

After meeting with @rochoa & @davidmanzanares:

Background

Metadata

Currently metadata has entries for the base columns (before aggregation), then if a base column is used in aggregations or dimensions those are listed in an array. I have added some support to Metadata to alleviate this, but it's still quite confusing, maybe should refactor Metadata (once again).

Dimension internal representation

Dates (as epoch numbers) are mapped to the 0,1 interval in property textures to increase the numeric precision, as often the magnitudes are too large for the range of variation.

That seems unnecessary for all cyclic dimensions, and for non cyclic that's probably only relevant for the smallest units (seconds, minutes).

For seconds, 2018 values require 31 bits, 25 bits for minutes.

To map the values, the max, min stats are needed, which are available currently only for the base date column (and might not be precise for generic MVT sources).

We can compute the dimension stats from them (for cyclic only the theoretical range), but taking into account the TZ correctly would require a time library like moments-timezon.js or Luxon. This would increase the size of the packages, add complexity and dependencies.

But we might not need accurate stats just for reducing the magnitude of the values; the differences due to incorrect TZ (+/-12 hours at most) would be small when the magnitude is large, which is the case for which mapping is relevant. So, we could choose to use mapping based on the units, then use the (approx) computed min, mas from base stats ignoring the TZ.

Dimension metadata

Global aggregation expressions (GlobalMin, GlobalMax, etc.) are computed for properties using the metadata stats. Those expressions are useful as linear defaults. For example animation by default applies linear to a property which in turn will take its min and max, and this makes animations for dates work nicely by default.

Dimension expressions, line cluster aggregation expressions, are not accepted by global aggregation functions, but they're really wrapping actual properties of the data, so it would make sense to accept them as properties. Then we'd need dimension metadata stats.

We have two options to support it: compute it in the client from the base columns stats, or add it to the backend.

Client stats:

To avoid dealing with TZs in the client we could use a single TZ (per layer), and have the backend provide base date columns (and its stats) as epochs in the TZ, so dim limits can be computed accurately. But this subverts the notion of UNIX epoch and introduces discontinuities in the values.

Server stats:

This is a cleaner way of not dealing with TZs in the client. Date values are UTC epochs, but all dim grouped values use the requested TZs.

TZ libraries

Two alternatives exist currently that seem actively maintained

jgoizueta commented 6 years ago

Example: how to return a (non-cyclic) month.

Month ordinal number (counting from some origin)

My initial implementation is a month counter, starting with 1 = 1970-01:

date_part('month', $t) + 12*(date_part('year', $t)-date_part('year', to_timestamp(0.0)))

(time zones are taking into account because $t is substituted ultimately by timezone(${tz}, to_timestamp(date_part('epoch', the_time)))

There we could have an optional origin (month 1) definable by the user instead of the fixed one:

1 + date_part('month', $t) - date_part('month', $start) + 12*(date_part('year', $t) - date_part('year', $start))

Month ISO text

Easy: to_char($t, 'YYYY-MM')

This can easily be converted to a sequential month number in the client with MM + (YYYY-y0). And also to a UNIX epoch with

new Date(`${YYYY}-${MM}-01 00:00:00+00`).getTime();

Computing the end of the time range (i.e. the start of the next month) is also feasible:

new Date(`${nextY}-${nextM}-01 00:00:00+00`).getTime();

With

function nextMonth(y, m) {
    m += 1;
    if (m > 12) {
         y += 1;
         m = 1;
    }
    return [y, m];
}

Time range

This seems to work: (I've tried some corner cases involving February)

date_part('epoch', date_trunc('month', t)) AS start,
date_part('epoch', date_trunc('month', t + INTERVAL '1 month')) AS end

But, if we want be able to manipulate it in the client (e.g. to show it as YYYY-MM), the problem is that the values are UTC epochs, so day divisions don't necessarily correspond to the requested TZ.

This is also more complicated to handle, both in backend where a single dimension generates two columns and in the client, where we must handle both as part of a unified property.

jgoizueta commented 6 years ago

Taking into account the previous comment, I now think that even if being able to access the start and end of a time period is very interesting, we can achieve that in the client without the complication of supporting the ranges in s the backend. So we could use one of the first to formats (months since X or YYYY-MM strings) and compute the ranges in the client.

jgoizueta commented 6 years ago

I'm learning some JavaScript: here's a better way to compute the month start and end epoch given Y, M:

{ start: Date.UTC(Y, M-1, 1), end: Date.UTC(Y, M, 1) }

Note that it counts months from 0 and that if the month exceeds 11 the year is adjusted.

jgoizueta commented 6 years ago

We could have auxiliar expressions to change the representation of (grouped) time, e.g. if we stick to the current ordinal numbers from the backend:

const N = clusterMonth($t) // => 579 (for $t = 2018-03-01T05:00:00+00)
MonthsSince(N, '2018-01') // => 2
MonthISO(N) // => '2018-03'
MonthStart(N) // => 1519862400
MonthEnd(N)   // => 1522540800
MonthRange(N)  // [1519862400, 1522540800]
MonthStartMs(N) // => 1519862400000
MonthEndMs(N)   // => 1522540800000

... Or we could have directly different cluster expressions for each representation:

clusterMonthsSince($t, '2018-01') // 2
clusterMonthISO($t)  // '2018-03'
clusterMonthRange($t) // [1519862400, 1522540800]

But then we'd have to support GlobalMin etc for each of those, I guess

For the first case, we cold make expressions smarter, so that they get the units from the clusterX expression:

since(clusterMonth($t), '2018-01') // => 2
iso(clusterMonth($t)) // => '2018-03'
range(clusterMonth($t)) // => [1519862400, 1522540800]

iso(clusterDay($t)) // => '2018-03-01'
// etc.
jgoizueta commented 6 years ago

Update on the PR #991

It requires this tiler branch

Several families of expressions have been implemented:

Cyclic expressions

These expression take numeric values: (but could be considered categories, as they have only a few number of distinct values: DayOfWeek (1:7), DayOfMonth (1:31), DayOfYear (1:366), HourOfDay (0:23), MinuteOfHour (0:59), QuarterOfYear (1:4).

Each one has an argument for the time property and an optional parameter for the timezone:

DayOfWeek($date, 'Europe/Madrid')

Time zone can be a fixed offset or a IANA name (handling daylight saving time changes).

GlobalMax, GlobalMin, linear behave as expected with these expressions. Note that the limits are the actual limits in the dataset/query, not necessary the theoretical range.

Serial expressions

These expressions can give the time in seconds, minutes, hours, days, months, years, weeks, quarters, trimesters, semesters, decades, centuries or millenia (phew!).

There are several forms for these expressions. The most complete gives serial numeric values, counting the specified periods of time since a starting time (by default it's 0001-01-01T00:00:00s so that year numbers make sense).

For example: this counts periods of 5 months (in Madrid time) starting in January 2017:

clusterMonth($date, 'Europe/Madrid', 5, '2017-01')

In the result, the number 1 corresponds to the period 2017-01 to 2017-05, 2 to 2017-06 to 2017-10, etc.

GlobalMin, GlobalMax and linear can also be used as expected.

ISO date format

Dates can also be handled as ISO format categories with these functions:

ClusterMonthIso($date, 'Europe/Madrid')

The results would be categories such as '2017-01'.

These have only an optional timezone parameter, no starting or count is possible like for the numeric functions.

GlobalMin and GlobalMax are usable here.

Apart from the usual YYYY-MM-DDTHH:MM:SS format some special syntax has been introduced, see here

Start/End epochs

Finally, there are also expressions that give the start and ending times (internally as UNIX epochs, but interpreted in the specified timezone):

clusterMonthStart($date, 'Europe/Madrid');
clusterMonthEnd($date, 'Europe/Madrid');

Both the start and end expressions can be used with the same date property (for the other expressions here only one expression can be used with a any date property).

Like the ISO functions they support only the optional timezone parameter.

GlobalMin, GlobalMax and linear can be used as expected.

These are useful to combine two animations to control precisely when features appear and dissapear, for example:

@monthStart: clusterMonthStart($date, 'Europe/Madrid')
@monthEnd: clusterMonthEnd($date, 'Europe/Madrid')
color: ramp(linear(@monthStart), SUNSET)
@animStart: animation(linear(@monthStart), 5, fade(0.1, 10))
@animEnd: animation(linear(@monthEnd), 5, fade(10, 0.1))
filter: @animStart and @animEnd

Questions to decide:

As discussed with @rochoa & @davidmanzanares the complete form of the serial expressions (as numerical values, e.g. clusterMonthmight be too complex to explain use and we could leave them out and keep only the others.

To avoid so many different expressions, we could, instead of having clusterSecond($date), clusterMinute($date), etc. have a single expression of each kind with an argument for the time units: clusterTime($date, 'second'), clusterTime($date, 'minute').

Summary

davidmanzanares commented 6 years ago

clusterTime($date, 'minute') sounds :checkered_flag: to me

davidmanzanares commented 6 years ago

Regarding start/end I would try to make it work seamlessly with https://github.com/CartoDB/carto-vl/issues/1012

Regarding ClusterMonthIso, is ISO a common name for this?

Would it be possible to use clusterTime for everything? does that make sense? I still need to think, but it would be awesome to have consistency between every date related expression

jgoizueta commented 6 years ago

Since we're dropping the numeric variants like ClusterMonth, we could use this naming:

Then for the cyclic functions, which return only numbers, we could maintain multiple expressions or use a single expression: (and we could reuse the same)

Or we could use clusterTime for everything, and then have additiona expressions for the start and end: timeStart(clusterTime('month', $date)), timeEnd(clusterTime('month', $date))

jgoizueta commented 6 years ago

Now it is possible to use simultaneously on the same base column the start, end and the iso expressions. The first two are useful for animations, the other could be used to get simpler values through interactivity or viewportFeatures, or for styling.

For example:

@mstart: clusterMonthStart($date, 'Europe/Madrid')
@mend: clusterMonthEnd($date, 'Europe/Madrid')
@month: clusterMonthIso($date, 'Europe/Madrid')
@selectedMonths: buckets(@month, ['2018-04', '2018-06', '2018-07', '2018-09'])
color: ramp(linear(@mstart), SUNSET)
width: ramp(@months, [50, 40, 30, 10, 20])
filter: animation(linear(@mstart), 5, fade(0.1,10)) and animation(linear(@mend), 5, fade(10,0.1))

Implementation note: a single property is transferred from the backend (the iso category); then it is decoded as three properties into the dataframe. Only the used variants are decoded.

jgoizueta commented 6 years ago

Update: the numeric expressions have been removed and the rest have been unified into:

So we have

And for any single property like $date, it can be used only with a single unit like 'dayOfMonth' or 'day'. But if it is used with a noncyclic unit like 'day', then it can be used with up to all the three expressions clusterTime, clusterTimeStart & clusterTimeEnd simultaneously. (I hope there's a clearer way to explain this 😅 )

For example:

jgoizueta commented 6 years ago

OK, we now have three variants of this in three different branches to choose from. Each of these derives from the preciding one and in general each one allows some further simplification of the API and the implementation (the ugly parts of it).

PR #991 (551-time-dimensions)

We have three expressions:

This if very flexible since the user chooses the expressions and where to apply them.

Usage:

Pro/Cons

PR #1055 (timerange-3prop)

This replates the categorical use of clusterTime together with clusterTimeStart and clusterTimeEnd by a new expressions clusterTimeRange (working name) that returns a new type, TimeRange.

This can be a simpler concept than having time categories, start and end dates as separate expressions. But requires each expression to take advange of this to be adapted to use it.

So we could end up with two different expressions with different types one numeric for day of the week, etc, and the other of time range kind for year, month, ...

Usage:

Pro/Cons

PR #1056 (timerange)

This is a modification of the previous that uses only two propertines (start, end) in the dataframe, and computes the ISO category when needed for user presentation (interactivity, viewportFeatures). So the ISO category is not available in the Dataframe nor consequently in shaders.

Usage:

Pro/Cons

Precision problem:

Categories are computed from start end data in the dataframe (stored as 32bits floats), so we can lose second accuracy and then category restoration fails (a second off will make impossible to recognized a period of time). This has been patched by changing the DF value mapping to just a translation (offset) to eliminate some of the roundoff error when mapping into float32 values and back. It has worked in the tests but there'll be probably still cases where it fails(*). A possible solution is to drop support for second-level resolution, and make minutes the smallest unit of time usable (so we could round DF unmapped values to the nearest minute).

(*) Millisecond precision is lost for times spans of about 12 years; and we have a problem at the second level at about 80 years. So it seems the simple hack can be used instead of reducing precision to minutes. In any case the total time span will determine the maximum achievable precision, and we should either document this limit or detect attemps to use higher than possible precision.

davidmanzanares commented 6 years ago

Con: some ugly code in windshaft source (need MSN to decide what properties variants are used) and in decoding/encoding (now in metadata specializations) Con: API complexity, inconsistent concept of start/end TZ.

Could you explain this a little bit more? Are these problems unique to the first approach?

I have mixed feelings here. On one hand, the timerange approach seemed more user-friendly to me. On the other hand, I'm worried about too much complexity and things like "Filter could be adapted for ranges" that are not simple nor trivial.

jgoizueta commented 6 years ago

Con: some ugly code in windshaft source (need MSN to decide what properties variants are used) and in decoding/encoding (now in metadata specializations)

That refers mostly to this and could be removed in the two TimeRange branches (since we always need the same 2 or 3 properties for each time dimension).

Ah!, and the decoding/encoding hassle is here. That can simplify a lot, specially in the last branch (timerange with 2 prop), and I have a semi-cooked refactor to put that in a separate codec class with proper naming (the decode/encode is used very confusingly here).

Con: API complexity, inconsistent concept of start/end TZ.

Refers to the fact that start/end dates are dates in the TZ requested but the user, but in plain JS you only have the concept of either UTC or local TZ. In the TimeRange approaches, since we have this specific abstraction we kinda can hide this inconsistency (e.g. by allowing accessing start, end values as epoch numbers; and our text representation is TZ-neutral).

jgoizueta commented 6 years ago

After meeting with @davidmanzanares & @rochoa we've decided to go for the last PR: timerange with only two properties. We can fail meaningfully when timerange is not supported by an expression and document the limitations.

Then we can gradually add support for all interesting cases like filters or aggregations later.

The use case of using time categories with buckets etc. (i.e. in the GPU) seems not too important, so we can live without it. (I think that if at some moment the need really arises we could reintroduce a new categorical expression like the clusterTimeIso that we originally had).

jgoizueta commented 6 years ago

Update: we now have a single PR, #991 with the final API design.

clusterTime

There's now a singe expression, clusterTime(property, units, timezone) where timezone is optional There's two different kind of units: recurring cyclic units like month of the year or day of the week and serial calendar periods like years, months, days, ...

The type of units determines the type of the expression result: for cyclic the type is number: e.g. 1-12 for month of the year, 1-7 for day of the week, etc.

For serial, the type is TimeRange which encapsulates a period of time (a year, a month, a minute, ...)

A TimeRange has these properties:

Cyclic (recurring) units as numeric values

Time range units and its text representation

The format is based on ISO 8601:

Usage

For the numeric case, the clusterTime expression can be used as any other numeric value inside linear, animation, etc. filter doesn't support it yet

For the time ranges, linear and animation have special support for this type. Animation behaves in a special way with a time range: it will hold for the duration of each time range, and fade in/out will be applied before and after the periods. (the behaviour can be disabled by wrapping the time range in a linear expression, which will be based on the start date of the time ranges)

viewportFeatures and interactivity is also compatible with clusterTime

Notes

like clusterAvg etc., using clusterTime precludes usage of the property directly. What's more, unlike cluster aggregation functions, once you use clusterTime on a property, that property can only appear inside clusterTime with the same arguments

Pending stuff

jgoizueta commented 6 years ago

I'll leave this short introduction of time dimensions here, which could be recycled for some doc, guide or post:


When backend aggregations are in use (for Windshaft sources), data is clustered spatially, and the clusterAggregations expressions give you the values of data properties summarised over the aggregation clusters (for example clusterMax($temperature), clusterAvg($price), clusterSum($population).

If a property is used out of any cluster aggregation expression, then it becomes a dimension of the data: it's used for clustering along with the spatial grid (just like torque cubes).

But usually that means that the aggregation is not so effective anymore, because the data cluster will be split by every single distinct value of the property, potentially dividing into a myriad of small clusters, often containing just a single feature.

The clusterTime expression allows you to use a time property as an aggregation dimension (rather than as an aggregated value as with the cluster aggregation expressions like clusterSum, etc.) in a controlled manner; that is, specifying how to discretise it so that we don't have too many unique values.

So clusterTime is a kind of classification or grouping into buckets methods. It lets you choose how a time property can be grouped into units of time (say months, days, minutes, ...) and then have each cluster of data be split by each unique value of this unit that occurs. For example clusterTime($date, 'month'), clusterTime($arrival_time, 'dayOfWeek').

To use aggregations effectively, each property to be used should be wrapped in either a cluster aggregation expression (clusterSum, clusterAvg, clusterMin, ...) or in a clusterTime expression. In the first case we obtained an aggregate of the property value over each cluster. In the second, the classified property becomes an aggregation dimension.

jgoizueta commented 6 years ago

Some desirable changes after talking with @davidmanzanares:

jgoizueta commented 6 years ago

There some documentation now for TimeCluster, TimeRangeExpr & TimeRange in the code, and some internal documentation here:

davidmanzanares commented 6 years ago

Addressed by https://github.com/CartoDB/carto-vl/pull/991