Closed juanignaciosl closed 5 years ago
@javitonino @antoniocarlon do you want to do a first CR? You'll probably see tons of things to change, such as the infamous complete
overwrites ;-) Be meticulous ;-)
On the logic side of things, I've seen... 1.- Many (>75k) geometries with several parents (block < blockgroup, mainly). Is that expected? 2.- I haven't checked yet how many children don't have parents. If a child don't have any parent, you won't be able to query. In those cases I could skip levels ("if this child don't have a parent in the next level, go up until you find one"). Do you think that it's worth taking this case into account?
Answering your questions: 1.- I wouldn't expect so many children with several parents, especially in the US. Could that be caused by spurious intersections between different levels (intersections with a very small area caused by the minimum differences in the simplifications for example)? 2.- I wouldn't pay much attention to this. For me is a corner case right now because we are trying to use geography levels with geometries that "match" as much as possible
1.- It doesn't look like minor intersections, weight
(the intersection area) is not close to 0.
gis=# select child_level, count(1) from (select child_id, child_level, parent_id, parent_level, weight from "us.hierarchy"."USHierarchyChildParentsUnion_2015_8bcc49cd24" where weight != 1 order by child_level, child_id) x group by child_level;
child_level | count
-------------------------+--------
block | 147428
block_group | 3757
census_tract | 252
county | 16
place | 14
school_district_unified | 34
zcta5 | 126
Some examples:
You can get the children with multiple parents with this:
select child_id, child_level, parent_id, parent_level, weight from "us.hierarchy"."USHierarchyChildParentsUnion_2015_8bcc49cd24" where weight != 1 order by child_level, child_id;
Then, for each child....
WITH RECURSIVE sublevels(child_id, child_level, parent_id, parent_level, weight) AS (
SELECT child_id, child_level, parent_id, parent_level, weight
FROM "us.hierarchy"."USHierarchyChildParentsUnion_2015_8bcc49cd24"
WHERE child_id = '<child_id>'
UNION ALL
SELECT c.child_id, c.child_level, c.parent_id, c.parent_level, c.weight
FROM "us.hierarchy"."USHierarchyChildParentsUnion_2015_8bcc49cd24" c,
sublevels p
WHERE p.parent_id = c.child_id
AND p.parent_level = c.child_level
)
select child_id, child_level, parent_id, parent_level, weight
FROM sublevels
040050023005167 block 040050023001 block_group 1263.5855334765538
040050023005167 block 040050023005 block_group 1263.5855334765538
040050023001 block_group 04005002300 census_tract 1
040050023005 block_group 04005002300 census_tract 1
04005002300 census_tract 86337 zcta5 1
04005002300 census_tract 86337 zcta5 1
160399605001108 block 160399604002 block_group 4619.387513039036
160399605001108 block 160399605001 block_group 4619.387513039036
160399604002 block_group 16039960400 census_tract 1
160399605001 block_group 16039960400 census_tract 25934066.03733179
160399605001 block_group 16039960500 census_tract 25934269.671080314
16039960400 census_tract 83647 zcta5 1
16039960400 census_tract 83647 zcta5 1
16039960500 census_tract 83648 zcta5 1
83648 zcta5 1654820 place 1
170279003003028 block 170279003003 block_group 14361.794205384369
170279003003028 block 170279003005 block_group 14361.794205384369
170279003003 block_group 17027900300 census_tract 1
170279003005 block_group 17027900300 census_tract 1
17027900300 census_tract 62230 zcta5 1
17027900300 census_tract 62230 zcta5 1
060770049012 block_group 06077004901 census_tract 3747872.021563051
060770049012 block_group 06077004902 census_tract 3747966.781465326
06077004901 census_tract 95320 zcta5 1
06077004902 census_tract 95320 zcta5 1
2.- According to the previous results, as you see most of the results stop at ZCTA5. I think that it might be caused by a low coverage of "place" level. Does it make sense? That's obviously not good. As you see, the code takes consecutive level pairs (block-block group, block group-census tract...) and computes the parent. I think that I will make it iterable as well:
1.- I have checked a couple of intersections and I think that the intersections algorithm may be wrong (I'll dive deeper):
160399605001108 block 160399604002 block_group 4619.387513039036
160399605001108 block 160399605001 block_group 4619.387513039036
060770049012 block_group 06077004901 census_tract 3747872.021563051
060770049012 block_group 06077004902 census_tract 3747966.781465326
It seems that the child geometries lay in only one parent (I have checked that the parents don't overlap)
2.- Yes, that makes sense because place
has not full US coverage. If we want to use all the geography levels, then we will need to do what you say (finding the next parent), but for MC I think that we don't need that, we can limit the geography levels (place
for example is not in the MC levels)
1.- Maybe st_pointonsurface
might be returning a point in an edge that is within both? The "algorithm" can't be much simpler...
2.- As this is meant to work with layers with no full coverage, I'll improve the approach :+1:
1.- Yesterday I was wrong because I used the simplified tables instead of the observatory.obs_xxx
tables for my tests. The problem is that we are removing the geometry holes when we populate the TIGER observatory tables and thus, making the parent geometries overlap. The problem is using ST_ExteriorRing
here https://github.com/CartoDB/bigmetadata/blob/master/tasks/us/census/tiger.py#L805-L807
I have created an issue to fix this: https://github.com/CartoDB/bigmetadata/issues/579
Executing make -- docker-run us.hierarchy.USLevelInclusionHierarchy --year 2015 --current-geography county --parent-geography state
raises
2018-09-21 08:07:40,923 [WARNING]: Will not run tasks.us.hierarchy.USLevelInclusionHierarchy(year=2015, current_geography=county, parent_geography=state) or any dependencies due to error in complete() method:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/luigi/worker.py", line 346, in check_complete
is_complete = task.complete()
File "/bigmetadata/tasks/us/hierarchy.py", line 190, in complete
table=self.input()['level'].qualified_tablename)).fetchall()) == 0
File "/usr/local/lib/python3.5/dist-packages/luigi/task.py", line 626, in input
return getpaths(self.requires())
File "/usr/local/lib/python3.5/dist-packages/luigi/task.py", line 826, in getpaths
return struct.__class__((k, getpaths(v)) for k, v in six.iteritems(struct))
File "/usr/local/lib/python3.5/dist-packages/luigi/task.py", line 826, in <genexpr>
return struct.__class__((k, getpaths(v)) for k, v in six.iteritems(struct))
File "/usr/local/lib/python3.5/dist-packages/luigi/task.py", line 824, in getpaths
return struct.output()
File "/bigmetadata/tasks/base_tasks.py", line 1108, in output
self._columns = self.columns()
File "/bigmetadata/tasks/us/census/tiger.py", line 784, in columns
('geoid', self.input()['geoids'][self.geography + '_{}'.format(self.year) + GEOID_SHORELINECLIPPED_COLUMN]),
File "/usr/local/lib/python3.5/dist-packages/luigi/task.py", line 626, in input
return getpaths(self.requires())
File "/usr/local/lib/python3.5/dist-packages/luigi/task.py", line 826, in getpaths
return struct.__class__((k, getpaths(v)) for k, v in six.iteritems(struct))
File "/usr/local/lib/python3.5/dist-packages/luigi/task.py", line 826, in <genexpr>
return struct.__class__((k, getpaths(v)) for k, v in six.iteritems(struct))
File "/usr/local/lib/python3.5/dist-packages/luigi/task.py", line 824, in getpaths
return struct.output()
File "/bigmetadata/tasks/base_tasks.py", line 118, in output
for col_key, col in self.columns().items():
File "/bigmetadata/tasks/us/census/tiger.py", line 86, in columns
name='Shoreline clipped ' + '_{}'.format(self.year) + col.name,
AttributeError: 'NoneType' object has no attribute 'name'
@antoniocarlon FYI this task has changed, but I haven't been able to reproduce it.
PGSERVICE=postgres10 docker-compose run -d -e LOGGING_FILE=etl_us.hierarchy.log bigmetadata luigi --module tasks.us.hierarchy tasks.us.hierarchy.USLevelInclusionHierarchy --year 2015 --current-geography county --parent-geographies \'[\"puma\", \"cbsa\", \"congressional_district\", \"state\"]\' --log-level INFO
:point_up: this is just for future reference, don't try it until next CR.
~Ready for next CR~ :red_circle:
Question for @antoniocarlon : the output of this task is two tables for Tilesets API, although this might be useful as well for DO. I've only used TempTableTask
tasks, and I guess that it's not right.
In addition, it's SLOW. For example, us.hierarchy.USLevelHierarchy(year=2015, current_geography=block, parent_geographies=["block_group", "census_tract", "zcta5", "place", "school_district_elementary", "school_district_secondary", "school_district_unified", "county", "puma", "cbsa", "congressional_district", "state"
took more than 8 hours, mostly because of the spatial join:
| INSERT INTO "us.hierarchy"."USLevelHierarchy_block___block_group____2015_2cc12b1390" +|
| (child_id, child_level, parent_id, parent_level, weight) +|
| +|
| SELECT +|
| cit.geoid AS child_id, +|
| cit.level AS child_level, +|
| pgt.geoid AS parent_id, +|
| pit.level AS parent_level, +|
| 1.0::FLOAT AS weight +|
| FROM "us.hierarchy"."USLevelInfoFromGeoNames_block_2015_51839c2b6d" cit +|
| INNER JOIN observatory.obs_06b3cddafd8b3e04b10fc99b06f82f31691adc1f cgt ON cit.geoid = cgt.geoid +|
| LEFT JOIN observatory.obs_79854433ead0c56f8eccb0eec5d9c43546c9c0b4 pgt +|
| ON ST_Within(ST_PointOnSurface(cgt.the_geom), pgt.the_geom) +|
| LEFT JOIN "us.hierarchy"."USLevelInfoFromGeoNames_state_2015_edb7353cc9" pit ON pgt.geoid = pit.geoid +|
| +|
| INNER JOIN "us.hierarchy"."USLevelHierarchy_block___block_group____2015_2cc12b1390" ot ON ot.child_id = cit.geoid+|
| AND ot.chi |
USLevelHierarchyWeights
is also pretty slow. Bottleneck:
| UPDATE "us.hierarchy"."USLevelHierarchy_zcta5___place____schoo_2015_5de4818d0b" +|
| SET weight = ST_Area( +|
| ST_Intersection(cgt.the_geom, pgt.the_geom), False) +|
| FROM +|
| observatory.obs_2ff6a0fbd931c42a5b3a8551c9cf3e81d064aa83 cgt, +|
| observatory.obs_9ff330b0ce6ba50b58fa12babc86a4a7011e1af2 pgt +|
| WHERE cgt.geoid = "us.hierarchy"."USLevelHierarchy_zcta5___place____schoo_2015_5de4818d0b".child_id +|
| AND pgt.geoid = "us.hierarchy"."USLevelHierarchy_zcta5___place____schoo_2015_5de4818d0b".parent_id+|
| AND (child_id, child_level) IN ( +|
| +|
| SELECT DISTINCT child_id, child_level +|
| FROM "us.hierarchy"."USLevelHierarchy_zcta5___place____schoo_2015_5de4818d0b" +|
| WHERE weight = 1 +|
| GROUP BY child_id, child_level +|
| HAVING count(1) > 1 +|
| +|
| ) +|
I still haven't devoted time to speed things up, but if you see something, let me know.
Well, there will probably be a ton of stuff beyond that :sweat_smile:
Recursive query example for getting the level hierarchy:
WITH RECURSIVE sublevels(child_id, child_level, parent_id, parent_level, weight) AS (
SELECT child_id, child_level, parent_id, parent_level, weight
FROM "us.hierarchy"."USHierarchyChildParentsUnion_2015_8bcc49cd24"
WHERE child_id = '010010201001001'
UNION ALL
SELECT c.child_id, c.child_level, c.parent_id, c.parent_level, c.weight
FROM "us.hierarchy"."USHierarchyChildParentsUnion_2015_8bcc49cd24" c,
sublevels p
WHERE p.parent_id = c.child_id
AND p.parent_level = c.child_level
)
select parent_id, parent_level, geoname, weight
FROM sublevels s
inner join "us.hierarchy"."USHierarchyInfoUnion_2015_8bcc49cd24" i
on i.geoid = s.parent_id and i.level = s.parent_level
output
010010201001 block_group Block Group 1 1
01001020100 census_tract 201 1
36067 zcta5 36067 1
0100240 school_district_unified Autauga County School District 1
01001 county Autauga 1
0102100 puma Elmore, Autauga, Montgomery (Outer) & Lowndes Counties PUMA 1
33860 cbsa Montgomery, AL 1
0102 congressional_district Congressional District 2 1
01 state Alabama 1
OK, :green_apple: , ready for next CR.
2da7813bc240363dc80a94666c834d6aca3f5ccf will probably break some task at hierarchy computation. I had to revert the change because it broke other stuff. I think that I had to add it because "<schema>".Tablename
won't work if schema contains dots, but I think that we can move on and add a fix if it fails again.
Ready for more CRs. The name case commit might have a broad impact, as we've said.
I've pushed a minor change in the helper script to display elapsed time on ongoing tasks.
I began with the acceptance yesterday (local) and it hasn't finished yet. I'll retry it this weekend.
As the task begins, I'm receiving a bunch of errors ERROR running ...
from the complete method of USLevelHierarchyWeights
. These are expected errors but the logging made me look into them, could we avoid logging the query and marking the errors as expected
in the logs?
I have executed make -- docker-run us.hierarchy.USHierarchy --year 2015
locally and it raises Unfulfilled dependencies at run time
. Reexecuting it didn't help:
[ERROR]: [pid 1] Worker Worker(salt=151657519, workers=1, host=f9b76b1b349b, username=root, pid=1) failed tasks.us.hierarchy.USHierarchyChildParentsUnion(year=2015)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/luigi/worker.py", line 187, in run
raise RuntimeError('Unfulfilled %s at run time: %s' % (deps, ', '.join(missing)))
RuntimeError: Unfulfilled dependencies at run time: tasks.us.hierarchy.USLevelHierarchyWeights_block___block_group____2015_2cc12b1390, tasks.us.hierarchy.USLevelHierarchyWeights_block_group___census_tract___2015_e7749a32d6
Executed locally:
This progress looks :) because there were no failed tasks or missing external dependencies
:rocket:
Back to acceptance to finish some tasks:
tiler.us_hierarchy_child_parents_{year}
.tiler.us_hierarchy_geonames_{year}
.tiler.us_hierarchy_child_parents (parent_id, parent_level)
index.year
column.I'll ping again when this is verified to be running for the tiler.
Blocked by the refactor here https://github.com/CartoDB/bigmetadata/pull/586
Ready for CR and acceptance, together with CartoDB/do_tiler/pull/38.
In local I've generated the new table and used it from the tiler successfully.
Added do not merge yet
because this must not be merged into master until a new DO dump is released, because it includes low level changes at ORM level that might break other stuff. As this branch is not needed for DO but for Tilesets API, this is not a problem for CR and acceptance.
:red_circle:
LOG_LEVEL=DEBUG make -- docker-run us.hierarchy.USDenormalizedHierarchy --year 2015
/usr/local/lib/python3.5/dist-packages/luigi/parameter.py:261: UserWarning: Parameter "year" with value "2015" is not of type string.
warnings.warn('Parameter "{}" with value "{}" is not of type string.'.format(param_name, param_value))
It looks like we may be passing an integer (from hierarchy tasks) to ShorelineClip (which takes a string).
Acceptance status:
Hierachy:
Tilesets:
Data has been deployed to production. This will be kept unmerged until a new DO dump is released.
This has been stuck for weeks. I'm going to merge it into Add_more_countries_to_MC_tiler
and ask for CR and Acceptance on it.
Description: CartoDB/Geographica-Product-Coordination/issues/14
@antoniocarlon could you take a quick look just to check that this is a good scaffolding, taken into account Luigi and existing classes and so on?
Tilesets API MRLI Release
These PRs (together):
Close these issues: