Improve tile rendering time by clustering during import process

AbelVM commented 7 years ago

While working on https://github.com/CartoDB/QuadGrid, which tries to solve a problem quite related to the query used by CARTO to fetch the data to render tiles. Benchmarking different approaches, I've found that just reshuffling the data to be clustered base on the_geom_webmercator leads to a performance improvement typically between 15% and 25%

The code used is https://github.com/CartoDB/QuadGrid/blob/master/sql/CDB_Quadgrid_recursive_R2.sql#L11-L22

But for import process, could be simplified just as

CLUSTER mytable USING mytable_the_geom_webmercator_idx;

CLUSTER function takes like 9s for a 400K points dataset, and for QuadGrid it lowers the processing time for that specific dataset from 4:15 min to 48s (so, 9s clustering and 39s processing). That's ~ 85% improvement!

cc @javitonino @rochoa

AbelVM commented 7 years ago

cc @jgoizueta because of the mention in Slack :grin:

AbelVM commented 7 years ago

Some simple tests with a sample table (1497711 points ~ 1.5M)

Using simplified tile checking as described here:

select count(*) from (
    SELECT g.the_geom_webmercator, g.col1, g.id
    FROM (
      SELECT the_geom_webmercator, clientid as col1, cartodb_id as id
      FROM sample
    ) g
) wrapped
where CDB_XYZ_Extent(x, y, z) && the_geom_webmercator

Selecting tiles of interest in the map view of the dataset, and different approaches:

as is
clustered by the_geom index
clustered by the_geom_webmercator index
adding a geohash B-Tree index and clustering by it

x	y	z	count	time 1	time 2	time 3	time 4
0	0	0	1497711	0.571s	0.6s	0.711s	0.582s
2	1	1	0	0.002s	0.002s	0.002s	0.003s
4	2	3	256689	0.608s	0.576s	0.522s	0.539s
3	3	3	896729	0.52s	0.617s	0.555s	0.548s
31	24	6	632158	0.515s	0.592s	0.543s	0.556s
62	48	7	284763	0.67s	0.566s	0.523s	0.533s
64	48	7	43591	0.335s	0.033s	0.025s	0.021s
4011	3088	13	15209	0.032s	0.015s	0.016s	0.012s

Some conclusions:

There's an obvious performance improvement at a low cost
The improvement is bigger for higher zoom levels
Tiles where the data is not evenly spread, like (64, 48, 7), can lead to x10 improvements, so the improvement is also related to the spatial distribution of the data.
Clustering by the_geom_webmercator implies a performance gain similar as clustering by st_geohash(the_geom), but we save disk space because we don't need to create an extra index
Clustering by the_geom also helps, but reproducing the comparison with other datasets shows that clustering by webmercator gives better performance in higher zooms
Weird result, tile (2,1,1) renders points but the query says there's no point in that specific query
As of today, SP-GiST is not supported by geometry columns, and my guess is that clustering by this index would be a leap in terms of performance for geospatial operations.
GiST is not supported on text columns, so I couldn't test it with geohash
SP-GiST supports text columns, but the resulting index can't be clustered
Clustering this way could lead to a visible enhancement in any query with spatial scan, while the time cost of clustering is in the order of seconds and there are no (AFAIK) negative implications of it

AbelVM commented 7 years ago

cc @oriolbx

AbelVM commented 7 years ago

SIDE NOTE: Clusters get degraded (fragmented) over time if the user performs write operations on the dataset, so the databases should be clustered as a maintenance task (ala VACUUM)

AbelVM commented 7 years ago

cc @inigomedina for awareness

CartoDB / cartodb-postgresql

Improve tile rendering time by clustering during import process #313