Closed Algunenano closed 5 years ago
Apparently the function is duplicated in crankshaft so either the extension is set up as dependency or any improvements need to be applied in both places.
I checked replacing the loops by a query using generate_series
(because I had a very similar query around, to see if that was an easy fix) and it still times-out 😞
This is the function I checked:
CREATE OR REPLACE FUNCTION CDB_RectangleGrid2(ext GEOMETRY, width FLOAT8, height FLOAT8, origin GEOMETRY DEFAULT NULL)
RETURNS TABLE(the_geom GEOMETRY)
AS $$
WITH values AS (
SELECT
xoff, yoff,
hw, hh,
hstep, vstep,
hw xgrd, hh ygrd,
xoff + ceil((ST_XMin(ext)-xoff)/hstep)*hstep AS hstart,
yoff + ceil((ST_Ymin(ext)-yoff)/vstep)*vstep AS vstart,
ST_XMax(ext) hend,
ST_YMax(ext) vend,
ST_SRID(ext) srid
FROM (
SELECT
CASE WHEN origin IS NULL THEN
0.0
ELSE
ST_X(origin)
END xoff,
CASE WHEN origin IS NULL THEN
0.0
ELSE
ST_Y(origin)
END yoff,
width/2.0 hw,
height/2.0 hh,
width AS hstep,
height AS vstep
) aux
)
SELECT
ST_MakeEnvelope(x-hw, y-hh, x+hw, y+hh, srid) the_geom
FROM values,
generate_series(hstart::numeric, hend::numeric, hstep::numeric) as x,
generate_series(vstart::numeric, vend::numeric, vstep::numeric) as y
$$ LANGUAGE SQL IMMUTABLE;
I think a good way to approach this would be to use a bigger zoom (e.g. zoom level 8) which will generate less geometries and use pgbench to measure it.
Also, we should keep in mind that maybe there are other ways to generate the grid view that doesn't involve intersecting a table with all the possible geometries there are for a certain zoom level.
Here is a sample query where this is used:
SELECT
ST_AsBinary(ST_Simplify(ST_SnapToGrid("the_geom_webmercator",978.394), 1956.79)) AS geom,
"agg_value"
FROM
(
WITH hgrid AS
(
SELECT CDB_RectangleGrid (
ST_Expand(ST_SetSRID('BOX3D(7514065.628545964 -2504688.542848656,20037508.3 12523442.71424328)'::box3d, 3857), CDB_XYZ_Resolution(2) * 13),
CDB_XYZ_Resolution(2) * 13,
CDB_XYZ_Resolution(2) * 13
) as cell
)
SELECT
hgrid.cell as the_geom_webmercator,
count(1) as agg_value,
count(1) /power( 10 * CDB_XYZ_Resolution(2), 2 ) as agg_value_density,
row_number() over () as cartodb_id
FROM
hgrid,
(SELECT * FROM benchmark_c7a5b45027138dabb7b080466b512cb6a7c9826445b3a1b57cb4a) i
WHERE ST_Intersects(i.the_geom_webmercator, hgrid.cell)
GROUP BY hgrid.cell
) as cdbq
I think the way to go is to change how this is achieved instead of using CDB_RectangleGrid
to generate all the tiles. I see some issues with the current approach:
After with @jgoizueta I think that the idea of using the aggregation API is interesting, both for the rectangle and hexagon grid. It'd need to support returning geometries (full squares or hexagons) and the parameters used currently for the aggregation in Builder and Editor (zoom / px / meters).
So, thinking about a "gridless" way to do this, one idea that comes to me is a CDB_RectangleCells(geometry, zlevel) that returns an array of all the tiles on that zoom level that intersect the input geometry. Potentially I guess they could then be un-nested, grouped and summarized in parallel (not sure that the unnesting wouldn't break the parallelism though).
What are your ideas on a way to get a parallel execution while still generating the desired result?
My initial idea was the same but instead of returning an array of geometries having another parameter (mode) and return only one geometry, aka: CDB_RectangleCells(geometry, mode, zlevel)
with mode being fast
(any/first point in the geometry), median
(using the geometric median of all points). I recon this means that a geometry would only intersect with one grid which isn't great but works for our use case since we are only using these functions with points datasets (not sure if this is guaranteed or we just check the first N geometries).
Now it comes to mind that to reduce overhead it could return the (X,Y) instead of the geometry which could be then calculated in the final step (aka CDB_TileToGeom(X,Y,Z)).
The function itself is IMMUTABLE (as long as the input and the tiles use the same SRID) so it's PARALLEL too. It also allows the data to be read only once and be easily filtered with a bbox so, if the calculation is fast (which would depend on the mode used) it should beat the current approach any day of the week. The GROUP BY is also parallelizable but I don't think that has too big of an impact in this case.
The exact same approach could be taken both for squares and hexagons, but I guess that the intersect with the hexagons might not be as simple.
For hexagons I think we need a definition of a standard hexagon tiling of our space, so that it's possible to say "level 4, hexagon 34,52" and know exactly what area we're talking about, every time. The Uber hexagon coverage could be a place to work from, or something less sophisticated in planar/mercator space.
Huh, I wonder what we do now for non-point grid/hex maps. Will have to take a look.
Stale
Depending on the zoom level,
CDB_RectangleGrid
can be extremely slow. For example:The query was generated as part of the map preview and the full query looks something like this: