Open AbelVM opened 7 years ago
cc @jgoizueta because of the mention in Slack :grin:
Some simple tests with a sample
table (1497711 points ~ 1.5M)
Using simplified tile checking as described here:
select count(*) from (
SELECT g.the_geom_webmercator, g.col1, g.id
FROM (
SELECT the_geom_webmercator, clientid as col1, cartodb_id as id
FROM sample
) g
) wrapped
where CDB_XYZ_Extent(x, y, z) && the_geom_webmercator
Selecting tiles of interest in the map view of the dataset, and different approaches:
the_geom
indexthe_geom_webmercator
indeximg | x | y | z | count | time 1 | time 2 | time 3 | time 4 | time 5 |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1497711 | 0.571s | 0.6s | 0.711s | 0.582s | ||
2 | 1 | 1 | 0 | 0.002s | 0.002s | 0.002s | 0.003s | ||
4 | 2 | 3 | 256689 | 0.608s | 0.576s | 0.522s | 0.539s | ||
3 | 3 | 3 | 896729 | 0.52s | 0.617s | 0.555s | 0.548s | ||
31 | 24 | 6 | 632158 | 0.515s | 0.592s | 0.543s | 0.556s | ||
62 | 48 | 7 | 284763 | 0.67s | 0.566s | 0.523s | 0.533s | ||
64 | 48 | 7 | 43591 | 0.335s | 0.033s | 0.025s | 0.021s | ||
4011 | 3088 | 13 | 15209 | 0.032s | 0.015s | 0.016s | 0.012s |
Some conclusions:
the_geom_webmercator
implies a performance gain similar as clustering by st_geohash(the_geom)
, but we save disk space because we don't need to create an extra indexthe_geom
also helps, but reproducing the comparison with other datasets shows that clustering by webmercator gives better performance in higher zoomscc @oriolbx
SIDE NOTE:
Clusters get degraded (fragmented) over time if the user performs write operations on the dataset, so the databases should be clustered as a maintenance task (ala VACUUM
)
cc @inigomedina for awareness
While working on https://github.com/CartoDB/QuadGrid, which tries to solve a problem quite related to the query used by CARTO to fetch the data to render tiles. Benchmarking different approaches, I've found that just reshuffling the data to be clustered base on the_geom_webmercator leads to a performance improvement typically between 15% and 25%
The code used is https://github.com/CartoDB/QuadGrid/blob/master/sql/CDB_Quadgrid_recursive_R2.sql#L11-L22
But for import process, could be simplified just as
CLUSTER
function takes like 9s for a 400K points dataset, and for QuadGrid it lowers the processing time for that specific dataset from 4:15 min to 48s (so, 9s clustering and 39s processing). That's ~ 85% improvement!cc @javitonino @rochoa