data regeneration: use better filters/spatial queries to identify identical parcel geometries

tbuckl commented 9 years ago

this query

https://github.com/MetropolitanTransportationCommission/bayarea_urbansim/blob/master/data_regeneration/geom_aggregation.py#L241-L243

has a very costly query plan:

mtc=> EXPLAIN SELECT * into stacked FROM parcels
mtc-> where geom in (select geom from parcels
mtc(> group by geom having count(*) > 1);
                                              QUERY PLAN

-------------------------------------------------------------------------------------------------------
 Nested Loop Semi Join  (cost=1845244.87..369729649373.73 rows=2527 width=510)
   Join Filter: (parcels.geom = parcels_1.geom)
   ->  Seq Scan on parcels  (cost=0.00..1196542.38 rows=2526738 width=510)
   ->  Materialize  (cost=1845244.87..2042251.70 rows=2526738 width=327)
         ->  GroupAggregate  (cost=1845244.87..1895779.63 rows=2526738 width=327)
               Filter: (count(*) > 1)
               ->  Sort  (cost=1845244.87..1851561.71 rows=2526738 width=327)
                     Sort Key: parcels_1.geom
                     ->  Seq Scan on parcels parcels_1  (cost=0.00..1196542.38 rows=2526738 width=327)
(9 rows)

we can probably use filters and postgis st_equals here:

for example, like this: http://stackoverflow.com/questions/18769250/finding-multiple-duplicates-in-postgres

tbuckl commented 9 years ago

i think this:

SELECT p2.gid
FROM parcels p1, parcels p2
WHERE ST_Equals(p1.geom, p2.geom) AND
p1.gid <> p2.gid;

might be more efficient. will check the results against the other.

tbuckl commented 9 years ago

@janowicz @mkreilly Was the intention of stacking to group parcels that share an identical bounding box? That seems to be what the GROUP BY below does.

https://github.com/MetropolitanTransportationCommission/bayarea_urbansim/blob/master/data_regeneration/geom_aggregation.py#L243

See: http://postgis.net/docs/manual-2.1/ST_Geometry_EQ.html

I will ask someone that knows more to double-check.

If not, then we could substitute with ST_EQUALS, which checks the identity of the entire geometry.

It seems to be a question of precision though, because ST_EQUALS can be subject to rounding errors.

janowicz commented 9 years ago

The intention was, as you said, to group parcels with the same geometry. Ah, good to know that group by geom means only that their bounding boxes are the same!

On Tue, Jun 30, 2015 at 10:31 AM, Tom Buckley notifications@github.com wrote:

= != =

— Reply to this email directly or view it on GitHub https://github.com/synthicity/bayarea_urbansim/issues/53#issuecomment-117271909 .

tbuckl commented 9 years ago

@janowicz ok well for posterity, the GROUP BY clause actually performs better for what the intention is. i think this is because of the rounding errors in st_equals. perhaps we should think about using the hash of the bounding box as an identifier in the future?

here's an example of that, for posterity. the stacked_not_identical (shown as brownish in here -- blue+red) parcels are parcels that are = (group by) but not st_equals (to some other parcel in the parcels table).

i'll leave this issue open because i do still wonder if we can improve that query.

tbuckl commented 9 years ago

this query produces very similar results and makes use of indexes. i'll submit a PR with it:

SELECT p1.* into stacked
FROM parcels p1, parcels p2
WHERE p1.geom && p2.geom
AND p1.geom=p2.geom AND
p1.gid <> p2.gid;

janowicz commented 9 years ago

@buckleytom Thanks!

tbuckl commented 9 years ago

This pull request closes this issue:

https://github.com/synthicity/bayarea_urbansim/pull/54

UDST / bayarea_urbansim

data regeneration: use better filters/spatial queries to identify identical parcel geometries #53