censusreporter / census-postgres-scripts

Scripts used to set up census-postgres on an Amazon EC2 instance.
MIT License
65 stars 34 forks source link

rationale for 'order by' clause in 13_index_tiger_2012.sql #6

Closed JoeGermuska closed 10 years ago

JoeGermuska commented 10 years ago

The explain plan for the INSERT INTO tiger2012.census_geo_containment statements in the tiger indexing shows the 'order by' to be very expensive.

Is there a rationale for ordering the inserts?

https://github.com/censusreporter/census-postgres-scripts/blob/master/13_index_tiger_2012.sql#L701-L703

iandees commented 10 years ago

It's expensive because its computing the area of the intersection between the parent and child geos, which can get CPU intensive. I did that because it means we get the results for containment in the order of containment percentage. At one point the api only returned the first child (the one that took up the most of the parent) but now it returns all of them so the ordering probably isn't as important nowadays.

JoeGermuska commented 10 years ago

but couldn't the api use order by size on the select query instead of on table create? It's specifically the sort part I'm talking about, separate from the area computation.

I'm asking because I'm trying to fix the state legislative stuff (#5) and running these scripts (or variants of them) on staging to see if it's going right. If we update production by running the scripts again (instead of some kind of import) then I'd rather minimize the downtime of recreating the census_geo_containment table.

iandees commented 10 years ago

I did it at table create time because it only has to run once instead of at each SELECT call. It can definitely happen then, it'll just move the work to the request cycle instead of the db build cycle.

If your tests show that ordering makes the insert faster then take it out and don't worry about sorting at all for now.

JoeGermuska commented 10 years ago

I just did an 'explain' before.

i turned timing on and tried the first insert (places in counties). Without: Time: 287978.120 ms With: 378834.444 ms

so it's slow in any case, although in the aggregate saving 25% on each may be worth it.

I wonder if we should update production by schema export/import instead of running everything? We'll probably want to update the tiger SQL dump that we publish.