Take a look at polygon intersection performance

javitonino commented 7 years ago

When we pass polygons to DO, they are intersected against the geometries and an interpolation is made to try to calculate what percentage of area is taken by each intersection. The code is here.

It would be nice to take a look around and see if we can find anything to improve performance.

antoniocarlon commented 7 years ago

The spatial predicate (st_intersects) is correct and uses spatial indices as expected.

The code performs a simplification of the user entities using the Visvalingam-Whyatt algorithm but only if the number of vertices of the geometry is >1000.

Simplifying the geometries to 0.00000001 degrees (more or less 1 millimeter) using ST_SimplifyVW also removes colinear vertices (as the algorithm progressively removes points with the least perceptible change in area) which boosts performance.

My proposal is to always perform simplification of the geometries to 0.00000001 degrees or less to reduce the number of points to compute the intersection.

javitonino commented 7 years ago

Can we get some quick benchmarks of this? I would be interested if the simplification code makes it go faster in all cases, as the simplification itself could take a while. Can we get some timings for both cases (unsimplified, simplify+intersect) with small and large polygons; e.g: ~5 points, ~1k points, you probably know which numbers of vertices are more interesting.

Finally, have you checked if the ST_Within are a performance boost or drag? It seems like the area of the intersection should be ok for all cases, and the ST_Within are just a shortcut for more trivial cases. But maybe it makes it go slower by adding more checks?

antoniocarlon commented 7 years ago

For small geometries (10 vertices) the times are the same: 250ms for both the queries (ith and without simplifications)
For medium size geometries (100 vertices) the times are the same (450ms)
For large geometries (1000 vertices) the time for the query without simplification raises to 4.5 seconds and the time for the query with simplified geometries raises to 2 seconds
For extra large geometries (3500 to 10000 vertices) the time for the query without simplification goes to 6.5 seconds and the time for the query with simplified geometries goes to 2.5 seconds

The ST_Within effect is very difficult to measure as it depends on the input data (if the first geometry bounding box is inside the second geometry, it's guaranteed to be within with very little comprobations, and vice-versa, but if the geometries sizes (in terms of area and length) are similar ST_Within will increment the cost as we need to perform two ST_Within operations and one intersection). In general, doing ST_Within to avoid intersections is a good idea (as intersections are very expensive).

javitonino commented 7 years ago

Looks good to me. In conclusion, we could have a significant gain with a very fine simplification (to millimeter levels), so maybe we should consider doing it (in obs-extension or camshaft).

It seems the times are pretty linear with the number of points per polygon, so maybe we could use the total number of points in a dataset (rows * n_points) as an estimation of how much time it will take. This could be useful to add some limits. We can user some of these measures in order to guide fine-tuning of the pre-check.

cc @ethervoid in case he has anything to us.

ethervoid commented 7 years ago

Love the idea to perform a minimum simplification and get a great gain.

So we're talking about doing the this kind of simplification here in order to simplify the procgeoms too, aren't we?

Maybe we could deal with this while we're importing the geometries with the ETL to avoid making operations at execution level?

antoniocarlon commented 7 years ago

Yes, that's the point, I would always simplify the geometries (a simplification factor of 0.0000001 is about 11 milimetres of tolerance) to perform the spatial queries (the code you are highlighting).

I would also invest some time in investigating the possible drawbacks of simplifying the geometries on the ETL to boost performance. The main drawback I have thought of is that, if the source geometries have topology, we may lost it (we will produce minimum intersections/holes between adjacent features) and we may need to regenerate the topology.

ethervoid commented 7 years ago

It seems that for example TIGER provides the topology faces (https://www2.census.gov/geo/pdfs/maps-data/data/tiger/tgrshp2016/TGRSHP2016_TechDoc_Ch2.pdf) so as you said could be problematic to perform the simplification at import level.

javitonino commented 7 years ago

I'm closing this one for now, as we got some information about the investigation. I've added a link from the wishlist, so we don't lose this.

antoniocarlon commented 7 years ago

I have the tested the performance impact of simplifying the DO geometries using the Visvalingam-Whyatt algorithm and it looks very promising (results here).

The drawback is that we loose the topology (if the imported data was topologically correct) but the impact (error) in the percent fill, and thus the expected impact on the results of the data augmentation using DO, is low.

We will need to determine the best tolerance factor to be applied to the ST_SimplifyVW function for each dataset and add it to the ETL.

antoniocarlon commented 7 years ago

I have done some tests checking the usage of ST_Within vs ST_CoveredBy. The performance of ST_Within is slightly better.

Also I have tested changing the order of statements inside the CASE to give precendence to the case where the user geometry completely covers the DO geometry:

WHEN ST_Within(t.the_geom, _geoms.wkb_geometry) THEN 1
WHEN ST_Within(_geoms.wkb_geometry, t.the_geom) THEN ST_Area(_geoms.wkb_geometry) / Nullif(ST_Area(t.the_geom), 0)

There is no performance improvement with this modification either.

The worst scenario for the query is when the user provides a very large geometry. In this case the spatial index of the DO layer is poorly selective.

javitonino commented 7 years ago

About VW simplification: What happens if you put a point somewhere where there is an overlap. We should make sure GetData returns a single value. I am not sure the current JOINs guarantee that. Have you tested this?

ethervoid commented 7 years ago

overlap of geometries for the boundaries?

javitonino commented 7 years ago

A input point, inside an overlap of DO boundaries.

ethervoid commented 7 years ago

Good question

antoniocarlon commented 7 years ago

Indeed. I'm making a test

antoniocarlon commented 7 years ago

I've tested it, and in case of overlaps for points the OBS_GetData function returns the result for one of the intersecting polygons. For polygons the OBS_GetData function returns the result of the sum of the intersecting polygons.

To test it I have used this query faking a overlaping polygon in top of the testing point/polygon:

select cdb_observatory.obs_getdata('{"(0101000020E61000009957010078865FC034F0486183F24840,1)"}',
'[{"id" : 1, "numer_id" : "ca.statcan.cols_nhs.t025c003_m", "timespan_rank" : 1, "score_rank" : 1, "timespan_rownum" : 1, "score_rownum" : 1, "score" : 1, "suggested_name" : "t025c003_m_2011", "numer_aggregate" : "sum", "numer_colname" : "t025c003_m", "numer_geomref_colname" : "geo_code", "numer_tablename" : "obs_ee1e05b06a5f3ef55d140a992fd984061ac06455", "numer_type" : "Numeric", "numer_description" : null, "numer_t_description" : null, "denom_aggregate" : "sum", "denom_colname" : "t001c001_m", "denom_geomref_colname" : "geo_code", "denom_tablename" : "obs_222244f6be207628b9175f095f99d9bdc810233b", "denom_type" : "Numeric", "denom_reltype" : "denominator", "denom_description" : null, "denom_t_description" : null, "geom_colname" : "the_geom", "geom_geomref_colname" : "geom_id", "geom_tablename" : "obs_aa1f4434a2a59b2c3bddcc151faac88bf8d6c8e6", "geom_type" : "Geometry", "geom_timespan" : "2011", "geom_description" : "", "geom_t_description" : null, "numer_timespan" : "2011", "numer_name" : "Total population - Christian (male)", "denom_name" : "Total population in private households (male)", "geom_name" : "Canada, provinces and territories", "normalization" : "prenormalized", "max_timespan_rank" : null, "max_score_rank" : null, "target_geoms" : null, "target_area" : null, "num_geoms" : null, "denom_id" : "ca.statcan.cols_nhs.t001c001_m", "geom_id" : "ca.statcan.geo.pr_"}]',
true)

javitonino commented 7 years ago

I'm closing this one, as our research can be considered complete. Moving it to the appropriate repo as well: https://github.com/CartoDB/bigmetadata/issues/292

CartoDB / observatory-extension

Take a look at polygon intersection performance #304