Closed JoeGermuska closed 10 years ago
It's expensive because its computing the area of the intersection between the parent and child geos, which can get CPU intensive. I did that because it means we get the results for containment in the order of containment percentage. At one point the api only returned the first child (the one that took up the most of the parent) but now it returns all of them so the ordering probably isn't as important nowadays.
but couldn't the api use order by size on the select query instead of on table create? It's specifically the sort part I'm talking about, separate from the area computation.
I'm asking because I'm trying to fix the state legislative stuff (#5) and running these scripts (or variants of them) on staging to see if it's going right. If we update production by running the scripts again (instead of some kind of import) then I'd rather minimize the downtime of recreating the census_geo_containment table.
I did it at table create time because it only has to run once instead of at each SELECT call. It can definitely happen then, it'll just move the work to the request cycle instead of the db build cycle.
If your tests show that ordering makes the insert faster then take it out and don't worry about sorting at all for now.
I just did an 'explain' before.
i turned timing on and tried the first insert (places in counties). Without: Time: 287978.120 ms With: 378834.444 ms
so it's slow in any case, although in the aggregate saving 25% on each may be worth it.
I wonder if we should update production by schema export/import instead of running everything? We'll probably want to update the tiger SQL dump that we publish.
The explain plan for the
INSERT INTO tiger2012.census_geo_containment
statements in the tiger indexing shows the 'order by' to be very expensive.Is there a rationale for ordering the inserts?
https://github.com/censusreporter/census-postgres-scripts/blob/master/13_index_tiger_2012.sql#L701-L703