Implement Metadata as needed by CARTO VL

jgoizueta commented 6 years ago

CartoVL is using the SQL API to obtain metadata about the dataset/query used, including a sample of the data. We must avoid using the SQL API from CartoVL and obtain this data from the Maps API to avoid requiring both Maps API and SQL API authorization keys.

At this point we want to implement quickly what cartovl needs, and eventually refactor it into something more reasonable and efficient.

We could just implement now an ad-hoc endpoint performing the exact same queries/data processing we do now in CartoVL.

Or, if we consider the effort will be similar (I'm inclined to think so), to implement it as optional metadata returned by the map instantiation. This will offer opportunities for optimization (now or later) since some of the metadata may be already computed/used, we save requests and could also save queries by combining requested metadata with e.g. the needs of getAggregationMetadata).

We could add a parameter to request metadata, e.g. "metadata": { sample: true, rowCount: true, columnStats: true } and the data could be added to the existing metadata.stats in the response. This could be nicely encapsulated in the setLayerStats function of the Windshaft-cartodb maps controller.

Details

What CartoVL does now

All the metadata CartoVL requests now is actually needed. Tne windshaft module encapsulates all SQL API requests through getSQL, which is used by the next functions which are called to prepare the metadata (in _getMetadata):

getSample(conf, sampling)
- query: SELECT * TABLESAMPLE BERNOULLI / random() < x
- => metadata.sample
getFeatureCount(query, conf)
- query: SELECT COUNT(*) FROM ${query}
- => metadata.featureCount
getColumnTypes(query, conf)
- query: select * from ${query} limit 0 => .fields (name: {type})
- => metadata.columns
- => metadata.categoryIDs, categoryIDsToName
- Note that, depending to the field type, columns are categorized as numeric, date or category (all strings).
getNumericTypes(names, query, conf) (executed for numeric columns)
- query: SELECT min($name), max($name), ...FROM ${query}
- => metadata.columns (name, type, min, max, avg, sum)
- COUNT(*) computed but not used (?)
getDatesTypes(names, query, conf) (executed for date columns)
- query: SELECT min($name), max($name) FROM ${query}
- => metadata.columns (name, type, min, max)
getCategoryTypes(names, query, conf) (executed for category columns)
- query: SELECT COUNT(*), ${name} FROM ${query} GROUP BY ${name}
- => metadata.categoryIDs, categoryIDsToName
- => metadata.columns (name, type, categoryNames, categoryCounts)
getGeometryType(query, conf)
- query: SELECT ST_GeometryType(the_geom) FROM ${query}
- => windshaft.geomType

In addition to the metadata (categoryIDs, columns, featureCount, sample). geomType is kept in windshaft object, used to deterimine if aggregation is possible and to decode MVT.

What the tiler already does at instantiation

The module query-utils of Windshaft-cartodb contains some functions to fetch metadata about the query. In particular a function getAggregationMetadata used to determined if aggregation should be applied which returns:

An estimate of the row count
- using CDB_EstimateRowCount (as meta.stats.estimatedFeatureCount)
the geometry type
- using SELECT ST_GeometryType(${geom}) FROM (${query}) WHERE ${geom} IS NOT NULL LIMIT 1

When default aggregation is used (sampling) the columns of the original query are obtained with a LIMIT 0 query (in getLayerAggregationColumns) to set the columns layer parameter.

The map instantiation response contains layergroup.metadata.layers[0].meta.stats.estimatedFeatureCount (which could be extended for additional metadata). It also contains layergroup.metadata.layers[0].meta.stats.aggregation which could be used for aggregated stats at some point.

jgoizueta commented 6 years ago

Whether it's a good idea to implement the metadata request in the map instantiation endpoint(s), or add a new specific endpoint.

Then we can start with the implementatin; other details can be decided later/after some experimentation:

Categories: now CartoVL requests all individual values with counts for any string column that appears in the query [FIXME: which appears in the query or which is used in viz?] This would be expensive in terms of data transfer for columns with many distinct values, such as ids in a large table.
- We could alleviate this by returning only the top N categories, or by not returning individual values for columns with many individual values according to PG stats. But this has a problem: if a filter column='value' is used in such a columns it may not work (in the client) if the particular value is not in the metadata stats (because string values are mapped to floats for execution in WebGL). For simple cases the filter will be executed in the server, but not in general (e.g. if combined with AND with an expression not executable in the db)
We can use estimated or actual counts for the total row count. Or we can leave it to the client to request one or another.
Per type statistics (min, max for dates, min, max, avg, sum for numbers, etc.) can be automatically decided by the server or requested by the client.

jgoizueta commented 6 years ago

Experimental map instantiation metadata is now available in #952

But there's a problem with returning metadata at instantiation and how we use it now at the client; Carto VL is using metadata for these two details of the instantiation:

The number of features in the (unaggregated) data source is used to decide whether to use backend filters
The column type is used to cast date columns to text (so they survive through MVT)

Possible solutions

A. Add a separate Maps API endpoint to only get the data without instantiation
B. Add conditional filtering (based on num. rows) and casting to the Maps API (or conditional variant/conditional query: one for small datasets, other for larger ones)
C. Re-instantiate the map if date columns are present or size is not large enough for filters
D. Make use of metadata for instantiation unnecessary:
- date columns could be identified by user by having to use them always inside special functions to them (day, hour, timebuckets...) that also will allow to use future dimensions API revision and solve current problem with time aggregation.
- backend filter could always be applied

Note that A and B are modifications of the Maps API. C and D involve only Carto VL changes

jgoizueta commented 6 years ago

Since MVT does not support date/time types, it would be nice to be able to cast those types into something (text strings or epoch numbers) that can be transferred in the MVT.

Automatically for all time/date columns in the tiler: when generating MVTs we detect column types with a LIMIT 0 query and wrap the query with explicit column selection.
On demand: the Maps API include some option to request column casts and proceed altering the query as in the previous case.
Automatically in the MVT Mapnik plugin (and in ST_ASMVT)

@Algunenano has mentioned that Mapnik doesn't currently support time/date for styling, and implementing the automatic casting at the plugin level would not only make those types available in MVTs, but would allow to use them to style raster tiles.

davidmanzanares commented 6 years ago

I would like a flavor of D.

Regarding the timestamp management, I think it would be best if Maps API automatically cast it to a usable form. Ideally, it would be compressed in some way (no strings).

Regarding filters, I would move the conditional logic to Maps API. I wouldn't apply filters every time since we saw this is overkill for most maps (small and medium datasets) since they won't be able to instantly refilter with just client-side logic, and the MVT sizes would be small even without filtering taken into account.

Basically, I think Maps API should return an instantiated map and a flag saying if filters were applied or not (similar to aggregation). When the filters change in the client, CARTO VL should re-instantiate if the flag indicates that Maps API filtered in the last instantiation.

Jesus89 commented 6 years ago

I reopen this to be closed after deployment.

Jesus89 commented 6 years ago

Closing this.

CartoDB / Windshaft-cartodb