Geometry simplification proof of concept

javitonino commented 7 years ago

From https://github.com/CartoDB/observatory-extension/issues/304

Summary of what we have learned:

Our intersection algorithms are fine, the complexity is in the data
We could simplify geometries to improve this:
- User geometry would have to be simplified on-the-fly which is slow. Instead, we have added some limits to the pre-check to limit its effect.
- DO boundaries could benefit for simplification
We can simplify DO geometries to improve performance significantly. But we have some issues with topologies, since that can cause holes or overlaps
Some benchmarking shows that the tradeoff could be worth it. We lose little precision and gain a lot in performance. DO is prepared to deal with the overlap and hole cases without throwing errors.
We could reduce holes/overlaps by using some more advanced simplification techniques (without going to full topology generation). e.g: https://github.com/mbloch/mapshaper

Next steps:

[x] Create an ETL task for geometry simplification
[x] Run it for the data of one or more countries
[x] Collect some usecases/maps using DO data. See how they behave with the simplified geometries

antoniocarlon commented 7 years ago

Tested on

es.ine
us.tiger (tabblocks) --> Needs PostGIS simplification
ca.statcan.geo (CSD)
br.geo (distritos, subdistritos, municipios)
au.geo (all geographies)

Tested on a local DO Analysis using a buffer on a rivers layer (thank you @AbelVM) in Spain and Europe.

ethervoid commented 7 years ago

Acceptance steps?

antoniocarlon commented 7 years ago

Adding a simplification task:

class RawGeometry(TempTableTask):
    ...

class SimplifiedRawGeometry(TempTableTask):
    def requires(self):
        return RawGeometry()

    def run(self):
        yield SimplifyGeometriesMapshaper(schema=self.input()._schema,
                                          table_input=self.input()._tablename,
                                          table_output=self.output()._tablename,
                                          geomfield='wkb_geometry')
        # The SimplifyGeometriesPostGIS task can be used as well

class Geometry(TableTask):
    def requires(self):
        return {
            'data': SimplifiedRawGeometry()
        }

    ...

Simplify an existing table (command line):

make -- run simplification SimplifyGeometriesMapshaper --schema es.ine --table-input rawgeometry__99914b932b

Note that SimplifyGeometriesPostGIS can also be used to simplify and that table_out and other default parameters can be explicitly specified.

The expected result is a simplified version of the input table in the same schema that the original one (not overwriting it), having the same record count, without invalid geometries and with less points (~50% by default). Also, if we are simplifying using Mapshaper, the output features ar expected to maintain topology.

ethervoid commented 7 years ago

I've found one problem using the MapShaper task. OGR2OGR is creating invalid geometries in the import of geometries not in the simplification process due to the use of Multipolygon, from the ogr2ogr documentation:

Some forced geometry conversions may result in invalid geometries, for example when forcing conversion of multi-part multipolygons with -nlt POLYGON, the resulting polygon will break the Simple Features rules.

Here is the example:

gis=# select count(*) from observatory.obs_43157a9633ea9b512d00a8416d5faf1d0c8452fd where ST_IsValid(the_geom) is false;
NOTICE:  Ring Self-intersection at or near point -4.1010282450000082 42.791497261000011
 count
-------
     1
(1 row)

So, I think we should execute a ST_CollectionExtract(ST_MakeValid(the_geom), 3) after the ogr2ogr operation ended.

antoniocarlon commented 7 years ago

I'll take a look at it

ethervoid commented 7 years ago

Acceptance (PASSED)

I've added simplification tasks for the Spain and Canada Census data:

[X] Tables are created for both simplified and non-simplified tasks
[X] Checked that we have different tables for PostGIS and Mapshaper tasks
[X] Checked that there is a real simplification in the geometries and how accurate it is. In this point, I've found that maybe with the default value we could have problems for example in Spain (chek the difference between green and purple in the capture below)

ES.INE and ES.CNIG

MAPSHAPER

INE

WITHOUT SIMPLIFICATION

  max  | min |         avg
-------+-----+----------------------
 15067 |   4 | 186.3554505005561735
(1 row)

WITH SIMPLIFICATION

 max  | min |         avg
------+-----+----------------------
 4047 |   4 | 100.0852335928809789
(1 row)

There is no visual difference between the two kind of geometries

CNIG CCAA

WITHOUT SIMPLIFICATION

 max  | min |         avg
------+-----+----------------------
 1452 |  13 | 650.3157894736842105

WITH SIMPLIFICATION

 max | min |         avg
-----+-----+----------------------
 834 |   6 | 325.4736842105263158

Here is an image with the difference betwwen simplified and not-simplified geometries for Balears

#### POSTGIS Using PostGIS to simplify we can see a significative change in the INE but not in CNIG ##### INE **WITHOUT SIMPLIFICATION** ``` max | min | avg -------+-----+---------------------- 15067 | 4 | 186.3554505005561735 ``` **WITH SIMPLIFICATION** ``` max | min | avg -------+-----+---------------------- 4029 | 4 | 96.8284204671857620 ``` ##### CNIG CCAA **WITHOUT SIMPLIFICATION** ``` max | min | avg ------+-----+---------------------- 1452 | 13 | 623.1111111111111111 ``` **WITH SIMPLIFICATION** ``` max | min | avg ------+-----+---------------------- 1443 | 13 | 621.3333333333333333 ``` ### CANADA #### MAPSHAPER ###### CSD We've reach about 50% of reduction in the case of CSD and PR using mapshaper **WITHOUT SIMPLIFICATION** ``` max | min | avg --------+-----+----------------------- 811816 | 5 | 1400.2019798210546354 ``` **WITH SIMPLIFICATION** ``` max | min | avg --------+-----+----------------------- 747598 | 5 | 581.7612792689891491 ``` ###### PR **WITHOUT SIMPLIFICATION** ``` max | min | avg ---------+------+--------------------- 1297578 | 6332 | 260726.307692307692 ``` **WITH SIMPLIFICATION** ``` max | min | avg ---------+------+--------------------- 926129 | 981 | 131491.384615384615 ``` #### POSTGIS ###### PR **WITHOUT SIMPLIFICATION** ``` max | min | avg ---------+------+--------------------- 1297578 | 6332 | 260726.307692307692 ``` **WITH SIMPLIFICATION** ``` max | min | avg ---------+------+--------------------- 1287424 | 1186 | 220443.769230769231 ``` ###### CSD **WITHOUT SIMPLIFICATION** ``` max | min | avg --------+-----+----------------------- 811816 | 5 | 1400.2019798210546354 ``` **WITH SIMPLIFICATION** ``` max | min | avg --------+-----+----------------------- 805000 | 4 | 679.7540453074433657 ```

antoniocarlon commented 7 years ago

Those are the expected results.

The Mapshaper based simplification's main parameter is retain percentage which means that, if this parameter is 60, the simplification will remove 40% of the vertices.
The PostGIS based simplification's main parameter is simplification factor which is a tolerance and thus, a distance between vertices to be considered for simplification

Based on this explanation, the correlation between a "worse" simplification (removing too many significative points) using Mapshaper and removing too few points using PostGIS simplification using the default parameter values is direct and expected.

Anyway, we need to find the best simplification parameter for each table to improve performance without losing significative points of the geometry.

AbelVM commented 7 years ago

These might be good reads:

javitonino commented 7 years ago

Mapshaper also takes a resolution parameter (instead of the retain %). Maybe that is more similar to postgres parametrization?

antoniocarlon commented 7 years ago

I have tested several input parameters and alternatives and I think that using the percentage of retained points is a good approach.

The results are not bad at all, I just wanted to point out (and document) that the correlation that we see in the tests is the result that we expected and reminds us that we can't rely on the default parameters.

javitonino commented 7 years ago

Yeah, results look good. But looking at Mario's number, it seems like Canada could use a more aggressive simplification than Spain (which is already pretty simplified). In that sense, maybe both geometries are OK to be simplified with a parameter of 100m, but for the % approach we would probably need to set separate parameters for each. Not a big deal, since we are probably going to be tweaking those numbers manually anyway, so this is probably a moot point.

antoniocarlon commented 7 years ago

Yes, as @AbelVM pointed out with his articles, the simplification factor will depend a lot on the complexity of the geometries and the resolution of the geography (Canada for example has a very complex shore line that we may need to maintain in more detailed resolutions).

Mapshaper has an interval parameter that allows us to specify the simplification factor in distance units, but as the PostGIS simplification will be marginally used (only for larger and very memory consuming datasets) and the retain percentage seemed more natural and convenient to me, I didn't used it.

Maybe I'm wrong and it'd be better to have a consistent parameter in both simplifications... I'll do a quick test now that I'm more unbiased thanks to your opinions :)

AbelVM commented 7 years ago

Typically, simplification means reducing the number of vertexes keeping the topology and the same values for RBF and shape factor. Which is not a trivial task if you want to be strict.

In the 2nd article of my previous comment, they explain that the geometrical info is not that relevant if the values within the simplified polygon remain ~ the same. So, simplification of a polygon in DO should take into account (the distribution of) the data that would be exposed through that geometry. Usually, population distribution should do the trick. GPW v4 (world population in a 250m resolution grid) dataset might help to identify the places where higher simplification could be applied

antoniocarlon commented 7 years ago

I've done a quick test and you were right. Having the same meaning for both factors (Mapshaper and PostGIS simplifications) is more intuitive although the value is different due to the different implementations. I was very biased and while testing I realized that it wasn't very natural having two parameters with almost opposite meanings.

I'm changing it :+1:

CartoDB / bigmetadata