CartoDB / carto-vl

CARTO VL: a Javascript library to create vector-based visualizations

BSD 3-Clause "New" or "Revised" License

129 stars 26 forks source link

Temporal aggregations #551

Closed jgoizueta closed 6 years ago

jgoizueta commented 6 years ago

This originated in #130 In that ticket, @davidmanzanares said:

Since timestamps can have very high resolutions, but most of the times we don't need those, we should have a mechanism for aggregations. After implementing the naive version (without aggregations, setting the timestamp in dimensions), we could add new aggregation functions that only works with timestamps, for example: HOUR(), DAY(), WEEK(), MIN(), SEC(), MSEC())

We need 3 things here:

[ ] Proper dimensions support in the Maps API for time columns
[ ] A way for CARTO VL to define the granularity of time properties used in visualization (and/or a way to discretize time in animations into a number of buckets as the original torque did).
[ ] Document and add error control to common date/time operations that are unsupported (see https://github.com/CartoDB/carto-vl/issues/523)

We should also considere a similar mechanism for quantization (in buckets, ranges, ...) of non-time properties to be sent as aggregation dimensions in the backend.

jgoizueta commented 6 years ago

Dimension Granularity Proposal

The idea is to allow defining the granularity (level of detail) of columns used as dimensions of the aggregation, i.e. columns which we don't want to summarize (place inside cluster aggregation functions such as clusterAvg()).

For example, given a timestamp column with millisecond resolutions, we may want to obtain it as years, months, weeks, days, hours, etc. to use for animation.

In torque manage the granularity by specifying a number of buckets, but when using time values it is much more convenient to emply the common units of time.

We'll need to expand the aggregation API in the tiler to support this and define a way to use this feature in Carto VL.

Client Side (CARTO VL API)

I propose to implement expressions to define the granularity of a column, e.g. timeInHours($t), in a similar way to clusterAvg and the other cluster aggregations:

Its argument must be a property
Using it prevents the property from being used in raw form ($t) or inside other granularity or cluster aggregation expressions

I'm not considering any alternative, such as variable/property declarations to define the granularity, because similar alternatives were discussed and dismissed for cluster aggregation function.

The use of a timeInHours($t) expresion introduces a new column t_hours defined as a SQL expression (which returns an epoch in hours) which would also be included in the GROUP BY of the aggregation.

We could have similar functions for numeric columns e.g. to classify by break points or number of buckets.

If the instantiation reports that data won't be aggregated and any of these expressions is used, it should fail (as for cluster aggr functions).

Naming

We should define a consistent terminology for these expressions which makes its intent as clear as possible.

Some options I can think of:

timeInX (X = Years/Months/Weeks/DaysHhours/...); numberInBuckets
aggregatedX; aggregedBuckets
grainX; grainBuckets
inX; inBuckets
asX; asBuckets

Backend (Maps API)

The current dimensions parameter of aggregation would be expanded to accept a parameterized form of the discretization options, e.g.

{
  aggregation: {
    dimensions: {
      t_in_hours: {
        column: 't',
        granularity: 'hours'
      }
    }
  }
}

This would generate this kind of SQL wrapping:

SELECT date_part('epoch', date_trunc('hour', t))/3600 AS t_in_hours, ...
  FROM ...
  GROUP BY 1, ...

We should decide which format is more convenient for the resulting values:

serial (epoch) numbers of the specified units (days since ..., etc)
ISO strings 'YYYY', 'YYYY-MM', 'YYYY-Www', 'YYYY-MM-DD', ...
ISO-like integers YYYY YYYYMM YYYYWW YYYYMMDD YYYYMMDDHH ...

The first format could be used directly by CARTO VL; the others would require some conversion on the client part.

jgoizueta commented 6 years ago

A couple of ideas from @davidmanzanares:

By default animate($t, ...) would detect that $t is a date, then use its minimum, maximum and the duration of the animation to pick some default unit of time and would apply the corresponding granularity expression
In addition to units of time we should be able to group by durations (i.e. by a time window of some duration, like 20 minute buckets)

jgoizueta commented 6 years ago

We could have some auxiliary expressions like:

timeUnitsFor($t, desiredNumberOfBuckets=100) to find a good unit of time for a property to approximate the desired number of buckets
timeDurationFor($t, desiredNumberOfBuckets=100) to find the window duration to group a property into the desired number of buckets

davidmanzanares commented 6 years ago

Yes, I think this is the way to go.

Regarding API I like clusterHours($date), clusterYears($date) or maybe in singular (like in Julia: https://docs.julialang.org/en/stable/stdlib/Dates/index.html ).

Regarding the ability to define the number of buckets, maybe we could specify the minimum duration / granularity instead of the number of buckets, for example: clusterSeconds($date, 4) meaning aggregate $date in time intervals of 4 seconds. Floating point values could be used to: clusterSeconds($date, 0.1) aggregate at tenths of a second.

Regarding the type of the returned values, I would stick to floating point values in seconds from the minimum value:

decodedDate = property.minimumDate + decodedValue

The reason for not using strings is space and the reason for not using integers is the ability to use aggs like clusterSeconds($date, 0.1). Also, usage of floating point allows us to use always seconds without problems since in the case we are dealing with huge values like decades the exponent is going to save us from overflow while keeping a good precision (relative to that magnitudes).

jgoizueta commented 6 years ago

I like the simple names (not in singular, because to me it seems that hour(t) extracts the hour of the time rather than giving the time in hours). But I thought that, just like in the clusterX functions we should use a common prefix to make evident these are special functions with similar behaviour.

And I really like the hours($t, 4) idea, and also agree with giving all values in seconds from the min.

jgoizueta commented 6 years ago

After talking with @davidmanzanares : in some cases it could be interesting to be able to obtain the hour of day, day of week or month of the year (as opposed to the time in hours, days or months). So we could have expressions for that. A possibility would be to use singular (hour, day, month) for it, but would probably be too confusing.

jgoizueta commented 6 years ago

After conversation with @rochoa & @davidmanzanares

The case of aggregations by the hours of the day, hours of the week, days of the week, months of the year and day of the year are very common and we should focus on providing a good solution for them.

In all these cases we have unit of granularity and another unit which acts as a cycle (we compute the time in the first units modulo the second units).

So, we'd like to support three different kinds of cases, and we could do it with expressions with three arguments (two of them optional):

Time quantised in given units of time, e.g. hours (but always given as seconds) clusterHours($t) would return the data per hour as number of seconds since the minimum $t in increments of 3600.
Time quantised in N units of time (also given as seconds) clusterHours($t, N) would be like the previous one, but in increments of 14400.
Time quantised in N units of time repeating in cycles of another unit: clusterHours($t, 1, 'day') would return the data per hour of day (0-24), again in seconds (so 0 - 86400 in 3600 steps).

jgoizueta commented 6 years ago

The SQL expressions will get quite complex when the second argument N is not 1 and irregular units are involved (minutes, hours, days and weeks are regular in that they're always a fixed number of seconds, but months and years are not).

When the third argument (cycle) is present I think It can be more meaningful to return the time in the corresponding units (e.g. months 1 to 12 for clusterMonths($t, 1, 'year')), and it can also simplify the expressions for irregular units.

jgoizueta commented 6 years ago

After experimenting with a branch of the tiler which has an API for time dimensions, and return the values as ordinal values in the requested units, rather than epoch numbers (e.g. for days of the week it returns values 0 to 6; for days of the month 1 to 31, etc). I've detected some issues that we should discuss:

The current backend metadata includes only values (min, max, ...) for the un-aggregated query. Including data about aggregated columns is not straightforward because the values depend on the zoom level, but information about the classified dimensions is not a problem.

Now we map date columns (as epoch numbers) to values in the range [0, 1]. This is not valid for classified time dimensions if we used the units described above. I see to options to handle them:

Treat them as categories (that would be fine for cyclic classifications like day of the week; for non cyclic we could have many distinct values). We could do without metadata for this options by mapping values as they are discovered, as done for text categories
Map then to [0, 1]; we need the max and min values for this:
- We could add this information to the backend API
- ~~We could compute it in the client from the base column limits~~ (seems a bad idea: it would work for the cyclic cases and we'd need to reproduce all the date classifications in the client).

What do you think @davidmanzanares ?

Should we reconsider using epoch numbers for all time classifications? That would simplify the mapping the values internally, but we would need to add complexity to provide tools to represent the values in a natural form (epochs are not great for days for the week).

Then, if we go on with the current implementation (ordinals), is the categories approach a valid option? Otherwise I'll add backend metadata options to request stats about aggregation dimensions.

jgoizueta commented 6 years ago

Regarding the last comment, and after talking with @davidmanzanares it seems simpler to use the metadata we have now and derive the dimension limits from it (for cyclics, use the theoretical limits).

That requires us to replicate some time logic in the client, but I guess it's not too bad: https://github.com/CartoDB/carto-vl/blob/99a362941c4e58007b50d97300ea0b95c7e42f78/src/client/WindshaftWorker.js#L27-L75

jgoizueta commented 6 years ago

Currently pending decisions:

Change metadata to have derived columns information directly?
Deal with TZs in the client?
- NO:
- Use approx stats for mapping values, then do not support accurate global aggregations?
- ... or compute accurate stats at backend?
Single TZ per source to simplify expression parameters?

After meeting with @rochoa & @davidmanzanares:

We could have an initial version not supporting timezones in the client, and add that feature later. That would simplify the implementation for 1.0 and we could take decisions about backend metadata later.
The current format of non-cyclic dimensions, ordinal unit numbers from the epoch start (e.g. number of months since...) seems not ideal and confusing for users. (it could be better if the backend informs of what's the initial period, and even better if we count from the min of the data; and the user could have a parameter to define the origin). We have discussed a couple of alternatives:
- Use ranges: The backend returns values as epoch numbers, but it returns the start and end for each one. For example, for a dimension in months, each row/feature will have a property with the epoch (second) of the start of the month and another one with the end. We could use this for better animations (otherwise if we combine two animations and fade-in/fade-out values we cannot exactly animate each unit correctly):
- Use categories: The backend returns categories. Each category could be the epoch number where it starts, or, more conveniently, it could be an iso string like 2018-03 for a month. We could introduce new types for this. This would be a special kind of category in the sense that they also have a numeric value, i.e. a specific distance between two values: for categories A and B there's no notion of the separation between them, but the distance from 2018-03 to 2018-05 is double than from 2018-02 to 2018-03.

Background

Metadata

Currently metadata has entries for the base columns (before aggregation), then if a base column is used in aggregations or dimensions those are listed in an array. I have added some support to Metadata to alleviate this, but it's still quite confusing, maybe should refactor Metadata (once again).

Dimension internal representation

Dates (as epoch numbers) are mapped to the 0,1 interval in property textures to increase the numeric precision, as often the magnitudes are too large for the range of variation.

That seems unnecessary for all cyclic dimensions, and for non cyclic that's probably only relevant for the smallest units (seconds, minutes).

For seconds, 2018 values require 31 bits, 25 bits for minutes.

To map the values, the max, min stats are needed, which are available currently only for the base date column (and might not be precise for generic MVT sources).

We can compute the dimension stats from them (for cyclic only the theoretical range), but taking into account the TZ correctly would require a time library like moments-timezon.js or Luxon. This would increase the size of the packages, add complexity and dependencies.

But we might not need accurate stats just for reducing the magnitude of the values; the differences due to incorrect TZ (+/-12 hours at most) would be small when the magnitude is large, which is the case for which mapping is relevant. So, we could choose to use mapping based on the units, then use the (approx) computed min, mas from base stats ignoring the TZ.

Dimension metadata

Global aggregation expressions (GlobalMin, GlobalMax, etc.) are computed for properties using the metadata stats. Those expressions are useful as linear defaults. For example animation by default applies linear to a property which in turn will take its min and max, and this makes animations for dates work nicely by default.

Dimension expressions, line cluster aggregation expressions, are not accepted by global aggregation functions, but they're really wrapping actual properties of the data, so it would make sense to accept them as properties. Then we'd need dimension metadata stats.

We have two options to support it: compute it in the client from the base columns stats, or add it to the backend.

Client stats:

We'd need TZ supporting library
Currently metadata stored per base column, maybe derived columns should be present too.

To avoid dealing with TZs in the client we could use a single TZ (per layer), and have the backend provide base date columns (and its stats) as epochs in the TZ, so dim limits can be computed accurately. But this subverts the notion of UNIX epoch and introduces discontinuities in the values.

Server stats:

To add dim stats, the only problem is the interaction between different modules in the tiler... the agg queries module could expose methods to perform the required operations.
But then users might expect to have aggregation columns stats too, and they are zoom-dependent, so not as straight forward

This is a cleaner way of not dealing with TZs in the client. Date values are UTC epochs, but all dim grouped values use the requested TZs.

TZ libraries

Two alternatives exist currently that seem actively maintained

moment + moment-timezone Uses TZ files, the size overhead is probably significant (200K? I haven't checked it)
Luxon Newer library, doesn't need TZ files; it relies on native APIs, so support varies

jgoizueta commented 6 years ago

Example: how to return a (non-cyclic) month.

Month ordinal number (counting from some origin)

My initial implementation is a month counter, starting with 1 = 1970-01:

date_part('month', $t) + 12*(date_part('year', $t)-date_part('year', to_timestamp(0.0)))

(time zones are taking into account because $t is substituted ultimately by timezone(${tz}, to_timestamp(date_part('epoch', the_time)))

There we could have an optional origin (month 1) definable by the user instead of the fixed one:

1 + date_part('month', $t) - date_part('month', $start) + 12*(date_part('year', $t) - date_part('year', $start))

Month ISO text

Easy: to_char($t, 'YYYY-MM')

This can easily be converted to a sequential month number in the client with MM + (YYYY-y0). And also to a UNIX epoch with

new Date(`${YYYY}-${MM}-01 00:00:00+00`).getTime();

Computing the end of the time range (i.e. the start of the next month) is also feasible:

new Date(`${nextY}-${nextM}-01 00:00:00+00`).getTime();

With

function nextMonth(y, m) {
    m += 1;
    if (m > 12) {
         y += 1;
         m = 1;
    }
    return [y, m];
}

Time range

This seems to work: (I've tried some corner cases involving February)

date_part('epoch', date_trunc('month', t)) AS start,
date_part('epoch', date_trunc('month', t + INTERVAL '1 month')) AS end

But, if we want be able to manipulate it in the client (e.g. to show it as YYYY-MM), the problem is that the values are UTC epochs, so day divisions don't necessarily correspond to the requested TZ.

This is also more complicated to handle, both in backend where a single dimension generates two columns and in the client, where we must handle both as part of a unified property.

jgoizueta commented 6 years ago

Taking into account the previous comment, I now think that even if being able to access the start and end of a time period is very interesting, we can achieve that in the client without the complication of supporting the ranges in s the backend. So we could use one of the first to formats (months since X or YYYY-MM strings) and compute the ranges in the client.

jgoizueta commented 6 years ago

I'm learning some JavaScript: here's a better way to compute the month start and end epoch given Y, M:

{ start: Date.UTC(Y, M-1, 1), end: Date.UTC(Y, M, 1) }

Note that it counts months from 0 and that if the month exceeds 11 the year is adjusted.

jgoizueta commented 6 years ago

We could have auxiliar expressions to change the representation of (grouped) time, e.g. if we stick to the current ordinal numbers from the backend:

const N = clusterMonth($t) // => 579 (for $t = 2018-03-01T05:00:00+00)
MonthsSince(N, '2018-01') // => 2
MonthISO(N) // => '2018-03'
MonthStart(N) // => 1519862400
MonthEnd(N)   // => 1522540800
MonthRange(N)  // [1519862400, 1522540800]
MonthStartMs(N) // => 1519862400000
MonthEndMs(N)   // => 1522540800000

... Or we could have directly different cluster expressions for each representation:

clusterMonthsSince($t, '2018-01') // 2
clusterMonthISO($t)  // '2018-03'
clusterMonthRange($t) // [1519862400, 1522540800]

But then we'd have to support GlobalMin etc for each of those, I guess

For the first case, we cold make expressions smarter, so that they get the units from the clusterX expression:

since(clusterMonth($t), '2018-01') // => 2
iso(clusterMonth($t)) // => '2018-03'
range(clusterMonth($t)) // => [1519862400, 1522540800]

iso(clusterDay($t)) // => '2018-03-01'
// etc.

jgoizueta commented 6 years ago

Update on the PR #991

It requires this tiler branch

Several families of expressions have been implemented:

Cyclic expressions

These expression take numeric values: (but could be considered categories, as they have only a few number of distinct values: DayOfWeek (1:7), DayOfMonth (1:31), DayOfYear (1:366), HourOfDay (0:23), MinuteOfHour (0:59), QuarterOfYear (1:4).

Each one has an argument for the time property and an optional parameter for the timezone:

DayOfWeek($date, 'Europe/Madrid')

Time zone can be a fixed offset or a IANA name (handling daylight saving time changes).

GlobalMax, GlobalMin, linear behave as expected with these expressions. Note that the limits are the actual limits in the dataset/query, not necessary the theoretical range.

Serial expressions

These expressions can give the time in seconds, minutes, hours, days, months, years, weeks, quarters, trimesters, semesters, decades, centuries or millenia (phew!).

There are several forms for these expressions. The most complete gives serial numeric values, counting the specified periods of time since a starting time (by default it's 0001-01-01T00:00:00s so that year numbers make sense).

For example: this counts periods of 5 months (in Madrid time) starting in January 2017:

clusterMonth($date, 'Europe/Madrid', 5, '2017-01')

In the result, the number 1 corresponds to the period 2017-01 to 2017-05, 2 to 2017-06 to 2017-10, etc.

GlobalMin, GlobalMax and linear can also be used as expected.

ISO date format

Dates can also be handled as ISO format categories with these functions:

ClusterMonthIso($date, 'Europe/Madrid')

The results would be categories such as '2017-01'.

These have only an optional timezone parameter, no starting or count is possible like for the numeric functions.

GlobalMin and GlobalMax are usable here.

Apart from the usual YYYY-MM-DDTHH:MM:SS format some special syntax has been introduced, see here

Start/End epochs

Finally, there are also expressions that give the start and ending times (internally as UNIX epochs, but interpreted in the specified timezone):

clusterMonthStart($date, 'Europe/Madrid');
clusterMonthEnd($date, 'Europe/Madrid');

Both the start and end expressions can be used with the same date property (for the other expressions here only one expression can be used with a any date property).

Like the ISO functions they support only the optional timezone parameter.

GlobalMin, GlobalMax and linear can be used as expected.

These are useful to combine two animations to control precisely when features appear and dissapear, for example:

@monthStart: clusterMonthStart($date, 'Europe/Madrid')
@monthEnd: clusterMonthEnd($date, 'Europe/Madrid')
color: ramp(linear(@monthStart), SUNSET)
@animStart: animation(linear(@monthStart), 5, fade(0.1, 10))
@animEnd: animation(linear(@monthEnd), 5, fade(10, 0.1))
filter: @animStart and @animEnd

Questions to decide:

As discussed with @rochoa & @davidmanzanares the complete form of the serial expressions (as numerical values, e.g. clusterMonthmight be too complex to explain use and we could leave them out and keep only the others.

To avoid so many different expressions, we could, instead of having clusterSecond($date), clusterMinute($date), etc. have a single expression of each kind with an argument for the time units: clusterTime($date, 'second'), clusterTime($date, 'minute').

Summary

Cyclic: clusterXOfY($date) as number, timezone optional
Serial: clusterX($date) X=year/month/day/... as number, count, start and timezone optional
Serial ISO: clusterXIso($date) as category (YYYY-MM-DD...) timezone optional
Serial Start/End clusterXStart($date), clusterXEnd($date) as date, timezone optional

davidmanzanares commented 6 years ago

clusterTime($date, 'minute') sounds :checkered_flag: to me

davidmanzanares commented 6 years ago

Regarding start/end I would try to make it work seamlessly with https://github.com/CartoDB/carto-vl/issues/1012

Regarding ClusterMonthIso, is ISO a common name for this?

Would it be possible to use clusterTime for everything? does that make sense? I still need to think, but it would be awesome to have consistency between every date related expression

jgoizueta commented 6 years ago

Since we're dropping the numeric variants like ClusterMonth, we could use this naming:

clusterTime('month', $date, timezone) => '2018-01' for text results (formerly Iso functions)
clusterTimeStart('month', $date, timezone) and ClusterTimeEnd('month', $date, timezone)=> Date() for the start end dates

Then for the cyclic functions, which return only numbers, we could maintain multiple expressions or use a single expression: (and we could reuse the same)

clusterTime('dayOfWeek', $date, timezone)

Or we could use clusterTime for everything, and then have additiona expressions for the start and end: timeStart(clusterTime('month', $date)), timeEnd(clusterTime('month', $date))

jgoizueta commented 6 years ago

Now it is possible to use simultaneously on the same base column the start, end and the iso expressions. The first two are useful for animations, the other could be used to get simpler values through interactivity or viewportFeatures, or for styling.

For example:

@mstart: clusterMonthStart($date, 'Europe/Madrid')
@mend: clusterMonthEnd($date, 'Europe/Madrid')
@month: clusterMonthIso($date, 'Europe/Madrid')
@selectedMonths: buckets(@month, ['2018-04', '2018-06', '2018-07', '2018-09'])
color: ramp(linear(@mstart), SUNSET)
width: ramp(@months, [50, 40, 30, 10, 20])
filter: animation(linear(@mstart), 5, fade(0.1,10)) and animation(linear(@mend), 5, fade(10,0.1))

Implementation note: a single property is transferred from the backend (the iso category); then it is decoded as three properties into the dataframe. Only the used variants are decoded.

jgoizueta commented 6 years ago

Update: the numeric expressions have been removed and the rest have been unified into:

clusterTime(property, timeUnits, timezone) for example:
- clusterTime($date, 'month') // UTC by default // this returns a category ('2018-01'...)
- clusterTime($date, 'month', 'Europe/Madrid')
- clusterTime($date, 'dayOfWeek') // for cyclic units the result is a number, e.g. 1-7 here
clusterTimeStart(property, timeUnits, timezone) returns a Date and does not accept cyclic units
- clusterTimeStart($date, 'month') // UTC by default
- clusterTimeStart($date, 'month', 'Europe/Madrid')
clusterTimeEnd(property, timeUnits, timezone) returns a Date and does not accept cyclic units
- clusterTimeEnd($date, 'month') // UTC by default
- clusterTimeEnd($date, 'month', 'Europe/Madrid')

So we have

cyclic times as numbers, e.g. clusterTime($date, 'dayOfMonth') -> 1 to 31
times as categories, e. g. clusterTime($date, 'day') -> '2018-01-30'
times as start or end dates, e.g. clusterTimeStart($date, 'day'), clusterTimeEnd($date, 'day') -> 2018-01-30T00:00:00.000Z, 2018-01-31T00:00:00.000Z

And for any single property like $date, it can be used only with a single unit like 'dayOfMonth' or 'day'. But if it is used with a noncyclic unit like 'day', then it can be used with up to all the three expressions clusterTime, clusterTimeStart & clusterTimeEnd simultaneously. (I hope there's a clearer way to explain this 😅 )

For example:

using clusterTime($date, 'dayOfMonth') and clusterTime($date, 'day') 🚫
using clusterTime($date, 'day') and clusterTime($date, 'month') 🚫
using clusterTime($date, 'day') and clusterTimeStart($date, 'month') 🚫
using clusterTimeStart($date, 'day') and clusterTimeEnd($date, 'month') 🚫
using clusterTime($date, 'day') and clusterTimeStart($date, 'month') 🚫
using clusterTime($date, 'day') and clusterTimeStart($date, 'day') ✅
using clusterTimeStart($date, 'day') and clusterTimeEnd($date, 'day') ✅

jgoizueta commented 6 years ago

OK, we now have three variants of this in three different branches to choose from. Each of these derives from the preciding one and in general each one allows some further simplification of the API and the implementation (the ugly parts of it).

PR #991 (`551-time-dimensions`)

We have three expressions:

clusterTime which takes two different types: category with normal (serial) units of time, and numeric for cyclic units (month of the year, day of the week...).
clusterTimeStart and clusterTimeEnd are of date type (internally as epoch numberf).

This if very flexible since the user chooses the expressions and where to apply them.

Usage:

Linear NP works as expected (for numeric/date) with proper default limits.
Animation This can do animations right away and the HOLD trick can be used with a pair of start/end expressions.
Categories (use in GPU): can use buckets to use categories with ramps, etc.
Interactivity NP
viewportFeatures NP
Filter NP, can be done on the categorgy or with date ranges.

Pro/Cons

Pro: only properties needed for expressions used get to the Datafreame
Con: some ugly code in windshaft source (need MSN to decide what properties variants are used) and in decoding/encoding (now in metadata specializations)
Con: API complexity, inconsistent concept of start/end TZ.

PR #1055 (`timerange-3prop`)

This replates the categorical use of clusterTime together with clusterTimeStart and clusterTimeEnd by a new expressions clusterTimeRange (working name) that returns a new type, TimeRange.

This can be a simpler concept than having time categories, start and end dates as separate expressions. But requires each expression to take advange of this to be adapted to use it.

So we could end up with two different expressions with different types one numeric for day of the week, etc, and the other of time range kind for year, month, ...

Usage:

Linear OK (has been adapted)
Animation Can work now with linear, could work directly on time ranges if we adapt animation (and dispense with the need of the HOLD trick).
Categories : We cold adapt buckets, etc.
Interactivity NP (already adapted)
viewportFeatures NP (already adapted)
Filter could be adapted both for category selection (clusterTR($date) = '2017-03') and date ranges (clusterTR($date) > time(...)).

Pro/Cons

Con: three properties (ISO, start, end) for each timerange expression are always inserted in the dataframe.
Pro: API seems simpler, less tricky concepts
Con: Needs expression to be adapted to be timerange-aware.

PR #1056 (`timerange`)

This is a modification of the previous that uses only two propertines (start, end) in the dataframe, and computes the ISO category when needed for user presentation (interactivity, viewportFeatures). So the ISO category is not available in the Dataframe nor consequently in shaders.

Usage:

Linear OK (has been adapted)
Animation Can work now with linear, could work directly on time ranges if we adapt animation (and dispense with the need of the HOLD trick).
Categories : No possible to use it the GPU, so cannot use for buckets etc.
Interactivity NP (already adapted, with categories computed on the fly)
viewportFeatures NP (already adapted, same)
Filter could be adapted for ranges, but not for category selection.

Pro/Cons

Pro/Con: two properties (start, end) for each timerange expression are always inserted in the dataframe, so not as good as first option, but better than the second.
Pro: Simple API, less tricky concepts, simplified implementation in some areas.
Con: Needs expression to be adapted to be timerange-aware.
Con: numerric accuracy needed for category restoration (see below)

Precision problem:

Categories are computed from start end data in the dataframe (stored as 32bits floats), so we can lose second accuracy and then category restoration fails (a second off will make impossible to recognized a period of time). This has been patched by changing the DF value mapping to just a translation (offset) to eliminate some of the roundoff error when mapping into float32 values and back. It has worked in the tests but there'll be probably still cases where it fails(*). A possible solution is to drop support for second-level resolution, and make minutes the smallest unit of time usable (so we could round DF unmapped values to the nearest minute).

(*) Millisecond precision is lost for times spans of about 12 years; and we have a problem at the second level at about 80 years. So it seems the simple hack can be used instead of reducing precision to minutes. In any case the total time span will determine the maximum achievable precision, and we should either document this limit or detect attemps to use higher than possible precision.

davidmanzanares commented 6 years ago

Con: some ugly code in windshaft source (need MSN to decide what properties variants are used) and in decoding/encoding (now in metadata specializations) Con: API complexity, inconsistent concept of start/end TZ.

Could you explain this a little bit more? Are these problems unique to the first approach?

I have mixed feelings here. On one hand, the timerange approach seemed more user-friendly to me. On the other hand, I'm worried about too much complexity and things like "Filter could be adapted for ranges" that are not simple nor trivial.

jgoizueta commented 6 years ago

Con: some ugly code in windshaft source (need MSN to decide what properties variants are used) and in decoding/encoding (now in metadata specializations)

That refers mostly to this and could be removed in the two TimeRange branches (since we always need the same 2 or 3 properties for each time dimension).

Ah!, and the decoding/encoding hassle is here. That can simplify a lot, specially in the last branch (timerange with 2 prop), and I have a semi-cooked refactor to put that in a separate codec class with proper naming (the decode/encode is used very confusingly here).

Con: API complexity, inconsistent concept of start/end TZ.

Refers to the fact that start/end dates are dates in the TZ requested but the user, but in plain JS you only have the concept of either UTC or local TZ. In the TimeRange approaches, since we have this specific abstraction we kinda can hide this inconsistency (e.g. by allowing accessing start, end values as epoch numbers; and our text representation is TZ-neutral).

jgoizueta commented 6 years ago

After meeting with @davidmanzanares & @rochoa we've decided to go for the last PR: timerange with only two properties. We can fail meaningfully when timerange is not supported by an expression and document the limitations.

Then we can gradually add support for all interesting cases like filters or aggregations later.

The use case of using time categories with buckets etc. (i.e. in the GPU) seems not too important, so we can live without it. (I think that if at some moment the need really arises we could reintroduce a new categorical expression like the clusterTimeIso that we originally had).

jgoizueta commented 6 years ago

Update: we now have a single PR, #991 with the final API design.

`clusterTime`

There's now a singe expression, clusterTime(property, units, timezone) where timezone is optional There's two different kind of units: recurring cyclic units like month of the year or day of the week and serial calendar periods like years, months, days, ...

The type of units determines the type of the expression result: for cyclic the type is number: e.g. 1-12 for month of the year, 1-7 for day of the week, etc.

For serial, the type is TimeRange which encapsulates a period of time (a year, a month, a minute, ...)

A TimeRange has these properties:

text an ISO8601-based representation of the period, e.g. '2018-04' for a month, '2018-04-01T03' for an hour.
startDate, endDate are the start and end of the period; note that the period is defined as startDate <= t < endDate, so the end date is just after the end of the period, for example the start of April 2018 is 2018-04-01T00:00:00 and the end is 2018-04-02T00:00:00.
startValue and endValue represent the same instants but as millisecond epochs (since 1970-01-01 in the requested time zone. Note that startDate/endDate are relative to the requested timezone like startValue, endValue (UTC by default), so the TZ it portrays is meaningless.

Cyclic (recurring) units as numeric values

semesterOfYear (1:2)
trimesterOfYear (1:3)
quarterOfYear (1:4)
monthOfYear (1:12)
weekOfYear (1:53)
dayOfWeek (1:7)
dayOfMonth (1:31)
dayOfYear (1:366)
hourOfDay (0:23)
minuteOfHour (0:59)

Time range units and its text representation

The format is based on ISO 8601:

year 'YYYY'
month 'YYYY-MM'
day 'YYYY-MM-DD'
hour 'YYYY-MM-DDThh'
minute 'YYYY-MM-DDThh:mm'
second 'YYYY-MM-DDThh:mm:ss'
week represented as '2018-W03' (third week of 2018:, 2018-01-15 to 2018-01-21)
quarter (three months) as '2018-Q2' (second quarter of 2018, i.e. 2018-04 to 2018-06)
semester (six months) as '2018S2' (second semester of 2018, i.e. 2018-07 to 2018-12)
trimester (four months) as '2018t2' (second trimester of 2018, i.e. 2018-05 to 2018-08)
decade as 'D201' (decade 201, i.e. 2010 to 2019)
century as 'C21' (21st century, ie. 2001 to 2100)
millennium as 'M3' (3rd millennium: 2001 to 3000)

Usage

For the numeric case, the clusterTime expression can be used as any other numeric value inside linear, animation, etc. filter doesn't support it yet

For the time ranges, linear and animation have special support for this type. Animation behaves in a special way with a time range: it will hold for the duration of each time range, and fade in/out will be applied before and after the periods. (the behaviour can be disabled by wrapping the time range in a linear expression, which will be based on the start date of the time ranges)

viewportFeatures and interactivity is also compatible with clusterTime

Notes

like clusterAvg etc., using clusterTime precludes usage of the property directly. What's more, unlike cluster aggregation functions, once you use clusterTime on a property, that property can only appear inside clusterTime with the same arguments

Pending stuff

[x] fix broken tests
[x] Resolve conflicts with master. (Null/NaN support should be moved into the new property Codecs)
[x] Review and complete integration of codecs with GeoJSON, MVT sources
[x] Clean up remaining unnecessary code (review codec operations, windshaft/metadata)
[x] Metadata reorganization: some code in the base class belongs to Windshaft specialization
[x] Tests
[x] Docs

jgoizueta commented 6 years ago

I'll leave this short introduction of time dimensions here, which could be recycled for some doc, guide or post:

When backend aggregations are in use (for Windshaft sources), data is clustered spatially, and the clusterAggregations expressions give you the values of data properties summarised over the aggregation clusters (for example clusterMax($temperature), clusterAvg($price), clusterSum($population).

If a property is used out of any cluster aggregation expression, then it becomes a dimension of the data: it's used for clustering along with the spatial grid (just like torque cubes).

But usually that means that the aggregation is not so effective anymore, because the data cluster will be split by every single distinct value of the property, potentially dividing into a myriad of small clusters, often containing just a single feature.

The clusterTime expression allows you to use a time property as an aggregation dimension (rather than as an aggregated value as with the cluster aggregation expressions like clusterSum, etc.) in a controlled manner; that is, specifying how to discretise it so that we don't have too many unique values.

So clusterTime is a kind of classification or grouping into buckets methods. It lets you choose how a time property can be grouped into units of time (say months, days, minutes, ...) and then have each cluster of data be split by each unique value of this unit that occurs. For example clusterTime($date, 'month'), clusterTime($arrival_time, 'dayOfWeek').

To use aggregations effectively, each property to be used should be wrapped in either a cluster aggregation expression (clusterSum, clusterAvg, clusterMin, ...) or in a clusterTime expression. In the first case we obtained an aggregate of the property value over each cluster. In the second, the classified property becomes an aggregation dimension.

jgoizueta commented 6 years ago

Some desirable changes after talking with @davidmanzanares:

[x] TimeRangeAnimation: derive from BaseExpression with And as a child.
[ ] VariantExpression: try using setPrototypeOf instead of returning a Proxy to this
[ ] Codec isIdentity: find naming more descriptive of intent, i.e. differentiate the case of a per-property encoding and a universal one, which allows to introduce encoded values in GPU inlined expressions
[x] Check performance impact of returning arrays for internal encoded values in Codecs
[x] InParseIso compute start and end simultaneously

jgoizueta commented 6 years ago

There some documentation now for TimeCluster, TimeRangeExpr & TimeRange in the code, and some internal documentation here:

davidmanzanares commented 6 years ago

Addressed by https://github.com/CartoDB/carto-vl/pull/991

CartoDB / carto-vl

Temporal aggregations #551

Dimension Granularity Proposal

Client Side (CARTO VL API)

Naming

Backend (Maps API)

Background

Metadata

Dimension internal representation

Dimension metadata

TZ libraries

Month ordinal number (counting from some origin)

Month ISO text

Time range

Cyclic expressions

Serial expressions

ISO date format

Start/End epochs

Questions to decide:

Summary

PR #991 (551-time-dimensions)

Usage:

Pro/Cons

PR #1055 (timerange-3prop)

Usage:

Pro/Cons

PR #1056 (timerange)

Usage:

Pro/Cons

Precision problem:

clusterTime

Cyclic (recurring) units as numeric values

Time range units and its text representation

Usage

Notes

Pending stuff

PR #991 (`551-time-dimensions`)

PR #1055 (`timerange-3prop`)

PR #1056 (`timerange`)

`clusterTime`