Slow performance in CDB_RectangleGrid

Algunenano commented 7 years ago

Depending on the zoom level, CDB_RectangleGrid can be extremely slow. For example:

SELECT CDB_RectangleGrid ('0103000020110F000001000000050000003C36F772C5E369C19A6FECEB57AB43413C36F772C5E369C1292B3276739D5641D02678B00B995BC1292B3276739D5641D02678B00B995BC19A6FECEB57AB43413C36F772C5E369C19A6FECEB57AB4341'::geometry, CDB_XYZ_Resolution(14) * 12, CDB_XYZ_Resolution(14) * 12) as cell;
ERROR:  canceling statement due to statement timeout

SHOW statement_timeout;
 statement_timeout 
-------------------
 5min
(1 row)

The query was generated as part of the map preview and the full query looks something like this:

SELECT
    min(agg_value) min_val,
    max(agg_value) max_val,
    avg(agg_value) avg_val, 
    CDB_QuantileBins(array_agg(distinct(agg_value::numeric)), 5) as quantiles
FROM 
(
    WITH hgrid AS (
        SELECT CDB_RectangleGrid ( ST_Expand(ST_SetSRID(ST_GeomFromText('POLYGON((-13573560.9371348 2578210.49869538,-13573560.9371348 5928283.19126962,-7234721.4128759 5928283.19126962,-7234721.4128759 2578210.49869538,-13573560.9371348 2578210.49869538))'), 3857), CDB_XYZ_Resolution(14) * 12), CDB_XYZ_Resolution(14) * 12, CDB_XYZ_Resolution(14) * 12) as cell
    )
    SELECT
        hgrid.cell as the_geom_webmercator, count(1) as agg_value, count(1) / power( 12 * CDB_XYZ_Resolution(14), 2 ) as agg_value_density, row_number() over () as cartodb_id FROM hgrid, (SELECT * FROM yellow_1m) i where ST_Intersects(i.the_geom_webmercator, hgrid.cell) 
    GROUP BY hgrid.cell) _table_sql 

WHERE agg_value IS NOT NULL AND
    agg_value != 'infinity'::float AND
    agg_value != '-infinity'::float AND
    agg_value != 'NaN'::float;

Algunenano commented 7 years ago

Apparently the function is duplicated in crankshaft so either the extension is set up as dependency or any improvements need to be applied in both places.

jgoizueta commented 7 years ago

I checked replacing the loops by a query using generate_series (because I had a very similar query around, to see if that was an easy fix) and it still times-out 😞

This is the function I checked:

CREATE OR REPLACE FUNCTION CDB_RectangleGrid2(ext GEOMETRY, width FLOAT8, height FLOAT8, origin GEOMETRY DEFAULT NULL)
RETURNS TABLE(the_geom GEOMETRY)
AS $$
WITH values AS (
  SELECT
    xoff, yoff,
    hw, hh,
    hstep, vstep,
    hw xgrd, hh ygrd,
    xoff + ceil((ST_XMin(ext)-xoff)/hstep)*hstep AS hstart,
    yoff + ceil((ST_Ymin(ext)-yoff)/vstep)*vstep AS vstart,
    ST_XMax(ext) hend,
    ST_YMax(ext) vend,
    ST_SRID(ext) srid
    FROM (
      SELECT
      CASE WHEN origin IS NULL THEN
        0.0
      ELSE
        ST_X(origin)
      END xoff,
      CASE WHEN origin IS NULL THEN
        0.0
      ELSE
        ST_Y(origin)
      END yoff,
      width/2.0 hw,
      height/2.0 hh,
      width AS hstep,
      height AS vstep
    ) aux
)
SELECT
  ST_MakeEnvelope(x-hw, y-hh, x+hw, y+hh, srid) the_geom
FROM values,
  generate_series(hstart::numeric, hend::numeric, hstep::numeric) as x,
  generate_series(vstart::numeric, vend::numeric, vstep::numeric) as y
$$ LANGUAGE SQL IMMUTABLE;

Algunenano commented 7 years ago

I think a good way to approach this would be to use a bigger zoom (e.g. zoom level 8) which will generate less geometries and use pgbench to measure it.

Also, we should keep in mind that maybe there are other ways to generate the grid view that doesn't involve intersecting a table with all the possible geometries there are for a certain zoom level.

Algunenano commented 6 years ago

Here is a sample query where this is used:

SELECT
    ST_AsBinary(ST_Simplify(ST_SnapToGrid("the_geom_webmercator",978.394), 1956.79)) AS geom,
    "agg_value"
FROM
(
    WITH hgrid AS
    (
        SELECT CDB_RectangleGrid (
                ST_Expand(ST_SetSRID('BOX3D(7514065.628545964 -2504688.542848656,20037508.3 12523442.71424328)'::box3d, 3857), CDB_XYZ_Resolution(2) * 13),
                CDB_XYZ_Resolution(2) * 13,
                CDB_XYZ_Resolution(2) * 13
            ) as cell
    )
    SELECT
        hgrid.cell as the_geom_webmercator,
        count(1) as agg_value,
        count(1) /power( 10 * CDB_XYZ_Resolution(2), 2 ) as agg_value_density,
        row_number() over () as cartodb_id
    FROM
        hgrid,
        (SELECT * FROM benchmark_c7a5b45027138dabb7b080466b512cb6a7c9826445b3a1b57cb4a) i
    WHERE ST_Intersects(i.the_geom_webmercator, hgrid.cell)
    GROUP BY hgrid.cell
) as cdbq

I think the way to go is to change how this is achieved instead of using CDB_RectangleGrid to generate all the tiles. I see some issues with the current approach:

Not parallelizable.
The bbox is applied after generating the grid.

After with @jgoizueta I think that the idea of using the aggregation API is interesting, both for the rectangle and hexagon grid. It'd need to support returning geometries (full squares or hexagons) and the parameters used currently for the aggregation in Builder and Editor (zoom / px / meters).

pramsey commented 6 years ago

So, thinking about a "gridless" way to do this, one idea that comes to me is a CDB_RectangleCells(geometry, zlevel) that returns an array of all the tiles on that zoom level that intersect the input geometry. Potentially I guess they could then be un-nested, grouped and summarized in parallel (not sure that the unnesting wouldn't break the parallelism though).

What are your ideas on a way to get a parallel execution while still generating the desired result?

Algunenano commented 6 years ago

My initial idea was the same but instead of returning an array of geometries having another parameter (mode) and return only one geometry, aka: CDB_RectangleCells(geometry, mode, zlevel) with mode being fast (any/first point in the geometry), median (using the geometric median of all points). I recon this means that a geometry would only intersect with one grid which isn't great but works for our use case since we are only using these functions with points datasets (not sure if this is guaranteed or we just check the first N geometries).

Now it comes to mind that to reduce overhead it could return the (X,Y) instead of the geometry which could be then calculated in the final step (aka CDB_TileToGeom(X,Y,Z)).

The function itself is IMMUTABLE (as long as the input and the tiles use the same SRID) so it's PARALLEL too. It also allows the data to be read only once and be easily filtered with a bbox so, if the calculation is fast (which would depend on the mode used) it should beat the current approach any day of the week. The GROUP BY is also parallelizable but I don't think that has too big of an impact in this case.

The exact same approach could be taken both for squares and hexagons, but I guess that the intersect with the hexagons might not be as simple.

pramsey commented 6 years ago

For hexagons I think we need a definition of a standard hexagon tiling of our space, so that it's possible to say "level 4, hexagon 34,52" and know exactly what area we're talking about, every time. The Uber hexagon coverage could be a place to work from, or something less sophisticated in planar/mercator space.

pramsey commented 6 years ago

Huh, I wonder what we do now for non-point grid/hex maps. Will have to take a look.

Algunenano commented 5 years ago

Stale

CartoDB / cartodb-postgresql

Slow performance in CDB_RectangleGrid #314