CartoDB / bigmetadata

BSD 3-Clause "New" or "Revised" License
43 stars 11 forks source link

Parent names for geometries #572

Closed juanignaciosl closed 5 years ago

juanignaciosl commented 6 years ago

Description: CartoDB/Geographica-Product-Coordination/issues/14

@antoniocarlon could you take a quick look just to check that this is a good scaffolding, taken into account Luigi and existing classes and so on?

Tilesets API MRLI Release

These PRs (together):

Close these issues:

juanignaciosl commented 6 years ago

@javitonino @antoniocarlon do you want to do a first CR? You'll probably see tons of things to change, such as the infamous complete overwrites ;-) Be meticulous ;-)

On the logic side of things, I've seen... 1.- Many (>75k) geometries with several parents (block < blockgroup, mainly). Is that expected? 2.- I haven't checked yet how many children don't have parents. If a child don't have any parent, you won't be able to query. In those cases I could skip levels ("if this child don't have a parent in the next level, go up until you find one"). Do you think that it's worth taking this case into account?

antoniocarlon commented 6 years ago

Answering your questions: 1.- I wouldn't expect so many children with several parents, especially in the US. Could that be caused by spurious intersections between different levels (intersections with a very small area caused by the minimum differences in the simplifications for example)? 2.- I wouldn't pay much attention to this. For me is a corner case right now because we are trying to use geography levels with geometries that "match" as much as possible

juanignaciosl commented 6 years ago

1.- It doesn't look like minor intersections, weight (the intersection area) is not close to 0.

gis=# select child_level, count(1) from (select child_id, child_level, parent_id, parent_level, weight from  "us.hierarchy"."USHierarchyChildParentsUnion_2015_8bcc49cd24" where weight != 1 order by child_level, child_id) x group by child_level;
       child_level       | count  
-------------------------+--------
 block                   | 147428
 block_group             |   3757
 census_tract            |    252
 county                  |     16
 place                   |     14
 school_district_unified |     34
 zcta5                   |    126

Some examples:

You can get the children with multiple parents with this:

select child_id, child_level, parent_id, parent_level, weight from  "us.hierarchy"."USHierarchyChildParentsUnion_2015_8bcc49cd24" where weight != 1 order by child_level, child_id;

Then, for each child....

WITH RECURSIVE sublevels(child_id, child_level, parent_id, parent_level, weight) AS (
               SELECT child_id, child_level, parent_id, parent_level, weight
               FROM "us.hierarchy"."USHierarchyChildParentsUnion_2015_8bcc49cd24"
               WHERE child_id = '<child_id>'
            UNION ALL
               SELECT c.child_id, c.child_level, c.parent_id, c.parent_level, c.weight
               FROM "us.hierarchy"."USHierarchyChildParentsUnion_2015_8bcc49cd24" c,
                    sublevels p
               WHERE p.parent_id = c.child_id
               AND   p.parent_level = c.child_level
               )
select child_id, child_level, parent_id, parent_level, weight
FROM sublevels
040050023005167 block   040050023001    block_group 1263.5855334765538
040050023005167 block   040050023005    block_group 1263.5855334765538
040050023001    block_group 04005002300 census_tract    1
040050023005    block_group 04005002300 census_tract    1
04005002300 census_tract    86337   zcta5   1
04005002300 census_tract    86337   zcta5   1
160399605001108 block   160399604002    block_group 4619.387513039036
160399605001108 block   160399605001    block_group 4619.387513039036
160399604002    block_group 16039960400 census_tract    1
160399605001    block_group 16039960400 census_tract    25934066.03733179
160399605001    block_group 16039960500 census_tract    25934269.671080314
16039960400 census_tract    83647   zcta5   1
16039960400 census_tract    83647   zcta5   1
16039960500 census_tract    83648   zcta5   1
83648   zcta5   1654820 place   1
170279003003028 block   170279003003    block_group 14361.794205384369
170279003003028 block   170279003005    block_group 14361.794205384369
170279003003    block_group 17027900300 census_tract    1
170279003005    block_group 17027900300 census_tract    1
17027900300 census_tract    62230   zcta5   1
17027900300 census_tract    62230   zcta5   1
060770049012    block_group 06077004901 census_tract    3747872.021563051
060770049012    block_group 06077004902 census_tract    3747966.781465326
06077004901 census_tract    95320   zcta5   1
06077004902 census_tract    95320   zcta5   1

2.- According to the previous results, as you see most of the results stop at ZCTA5. I think that it might be caused by a low coverage of "place" level. Does it make sense? That's obviously not good. As you see, the code takes consecutive level pairs (block-block group, block group-census tract...) and computes the parent. I think that I will make it iterable as well:

antoniocarlon commented 6 years ago

1.- I have checked a couple of intersections and I think that the intersections algorithm may be wrong (I'll dive deeper):

160399605001108 block   160399604002    block_group 4619.387513039036
160399605001108 block   160399605001    block_group 4619.387513039036

1

060770049012    block_group 06077004901 census_tract    3747872.021563051
060770049012    block_group 06077004902 census_tract    3747966.781465326

2 3

It seems that the child geometries lay in only one parent (I have checked that the parents don't overlap)

2.- Yes, that makes sense because place has not full US coverage. If we want to use all the geography levels, then we will need to do what you say (finding the next parent), but for MC I think that we don't need that, we can limit the geography levels (place for example is not in the MC levels)

juanignaciosl commented 6 years ago

1.- Maybe st_pointonsurface might be returning a point in an edge that is within both? The "algorithm" can't be much simpler... 2.- As this is meant to work with layers with no full coverage, I'll improve the approach :+1:

antoniocarlon commented 6 years ago

1.- Yesterday I was wrong because I used the simplified tables instead of the observatory.obs_xxx tables for my tests. The problem is that we are removing the geometry holes when we populate the TIGER observatory tables and thus, making the parent geometries overlap. The problem is using ST_ExteriorRing here https://github.com/CartoDB/bigmetadata/blob/master/tasks/us/census/tiger.py#L805-L807

I have created an issue to fix this: https://github.com/CartoDB/bigmetadata/issues/579

antoniocarlon commented 6 years ago

Executing make -- docker-run us.hierarchy.USLevelInclusionHierarchy --year 2015 --current-geography county --parent-geography state raises

2018-09-21 08:07:40,923 [WARNING]: Will not run tasks.us.hierarchy.USLevelInclusionHierarchy(year=2015, current_geography=county, parent_geography=state) or any dependencies due to error in complete() method:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/luigi/worker.py", line 346, in check_complete
    is_complete = task.complete()
  File "/bigmetadata/tasks/us/hierarchy.py", line 190, in complete
    table=self.input()['level'].qualified_tablename)).fetchall()) == 0
  File "/usr/local/lib/python3.5/dist-packages/luigi/task.py", line 626, in input
    return getpaths(self.requires())
  File "/usr/local/lib/python3.5/dist-packages/luigi/task.py", line 826, in getpaths
    return struct.__class__((k, getpaths(v)) for k, v in six.iteritems(struct))
  File "/usr/local/lib/python3.5/dist-packages/luigi/task.py", line 826, in <genexpr>
    return struct.__class__((k, getpaths(v)) for k, v in six.iteritems(struct))
  File "/usr/local/lib/python3.5/dist-packages/luigi/task.py", line 824, in getpaths
    return struct.output()
  File "/bigmetadata/tasks/base_tasks.py", line 1108, in output
    self._columns = self.columns()
  File "/bigmetadata/tasks/us/census/tiger.py", line 784, in columns
    ('geoid', self.input()['geoids'][self.geography + '_{}'.format(self.year) + GEOID_SHORELINECLIPPED_COLUMN]),
  File "/usr/local/lib/python3.5/dist-packages/luigi/task.py", line 626, in input
    return getpaths(self.requires())
  File "/usr/local/lib/python3.5/dist-packages/luigi/task.py", line 826, in getpaths
    return struct.__class__((k, getpaths(v)) for k, v in six.iteritems(struct))
  File "/usr/local/lib/python3.5/dist-packages/luigi/task.py", line 826, in <genexpr>
    return struct.__class__((k, getpaths(v)) for k, v in six.iteritems(struct))
  File "/usr/local/lib/python3.5/dist-packages/luigi/task.py", line 824, in getpaths
    return struct.output()
  File "/bigmetadata/tasks/base_tasks.py", line 118, in output
    for col_key, col in self.columns().items():
  File "/bigmetadata/tasks/us/census/tiger.py", line 86, in columns
    name='Shoreline clipped ' + '_{}'.format(self.year) + col.name,
AttributeError: 'NoneType' object has no attribute 'name'
juanignaciosl commented 6 years ago

@antoniocarlon FYI this task has changed, but I haven't been able to reproduce it.

PGSERVICE=postgres10 docker-compose run -d -e LOGGING_FILE=etl_us.hierarchy.log bigmetadata luigi --module tasks.us.hierarchy tasks.us.hierarchy.USLevelInclusionHierarchy --year 2015 --current-geography county --parent-geographies \'[\"puma\", \"cbsa\", \"congressional_district\", \"state\"]\' --log-level INFO

:point_up: this is just for future reference, don't try it until next CR.

juanignaciosl commented 6 years ago

~Ready for next CR~ :red_circle:

Question for @antoniocarlon : the output of this task is two tables for Tilesets API, although this might be useful as well for DO. I've only used TempTableTask tasks, and I guess that it's not right.

In addition, it's SLOW. For example, us.hierarchy.USLevelHierarchy(year=2015, current_geography=block, parent_geographies=["block_group", "census_tract", "zcta5", "place", "school_district_elementary", "school_district_secondary", "school_district_unified", "county", "puma", "cbsa", "congressional_district", "state" took more than 8 hours, mostly because of the spatial join:

     |                 INSERT INTO "us.hierarchy"."USLevelHierarchy_block___block_group____2015_2cc12b1390"                             +| 
     |                 (child_id, child_level, parent_id, parent_level, weight)                                                         +| 
     |                                                                                                                                  +| 
     |         SELECT                                                                                                                   +| 
     |             cit.geoid AS child_id,                                                                                               +| 
     |             cit.level AS child_level,                                                                                            +| 
     |             pgt.geoid AS parent_id,                                                                                              +| 
     |             pit.level AS parent_level,                                                                                           +| 
     |             1.0::FLOAT AS weight                                                                                                 +| 
     |         FROM "us.hierarchy"."USLevelInfoFromGeoNames_block_2015_51839c2b6d" cit                                                  +| 
     |         INNER JOIN observatory.obs_06b3cddafd8b3e04b10fc99b06f82f31691adc1f cgt ON cit.geoid = cgt.geoid                         +| 
     |         LEFT JOIN observatory.obs_79854433ead0c56f8eccb0eec5d9c43546c9c0b4 pgt                                                   +| 
     |             ON ST_Within(ST_PointOnSurface(cgt.the_geom), pgt.the_geom)                                                          +| 
     |         LEFT JOIN "us.hierarchy"."USLevelInfoFromGeoNames_state_2015_edb7353cc9" pit ON pgt.geoid = pit.geoid                    +| 
     |                                                                                                                                  +| 
     |                 INNER JOIN "us.hierarchy"."USLevelHierarchy_block___block_group____2015_2cc12b1390" ot ON ot.child_id = cit.geoid+| 
     |                                             AND ot.chi                                                                            |

USLevelHierarchyWeights is also pretty slow. Bottleneck:

      |             UPDATE "us.hierarchy"."USLevelHierarchy_zcta5___place____schoo_2015_5de4818d0b"                     +| 
      |             SET weight = ST_Area(                                                                               +| 
      |                 ST_Intersection(cgt.the_geom, pgt.the_geom), False)                                             +| 
      |             FROM                                                                                                +| 
      |                 observatory.obs_2ff6a0fbd931c42a5b3a8551c9cf3e81d064aa83 cgt,                                   +| 
      |                 observatory.obs_9ff330b0ce6ba50b58fa12babc86a4a7011e1af2 pgt                                    +| 
      |             WHERE cgt.geoid = "us.hierarchy"."USLevelHierarchy_zcta5___place____schoo_2015_5de4818d0b".child_id +| 
      |               AND pgt.geoid = "us.hierarchy"."USLevelHierarchy_zcta5___place____schoo_2015_5de4818d0b".parent_id+| 
      |               AND (child_id, child_level) IN (                                                                  +| 
      |                                                                                                                 +| 
      |         SELECT DISTINCT child_id, child_level                                                                   +| 
      |         FROM "us.hierarchy"."USLevelHierarchy_zcta5___place____schoo_2015_5de4818d0b"                           +| 
      |         WHERE weight = 1                                                                                        +| 
      |         GROUP BY child_id, child_level                                                                          +| 
      |         HAVING count(1) > 1                                                                                     +| 
      |                                                                                                                 +| 
      |         )                                                                                                       +|

I still haven't devoted time to speed things up, but if you see something, let me know.

Well, there will probably be a ton of stuff beyond that :sweat_smile:

Recursive query example for getting the level hierarchy:

WITH RECURSIVE sublevels(child_id, child_level, parent_id, parent_level, weight) AS (
               SELECT child_id, child_level, parent_id, parent_level, weight
               FROM "us.hierarchy"."USHierarchyChildParentsUnion_2015_8bcc49cd24"
               WHERE child_id = '010010201001001'
                   UNION ALL
                   SELECT c.child_id, c.child_level, c.parent_id, c.parent_level, c.weight
                   FROM "us.hierarchy"."USHierarchyChildParentsUnion_2015_8bcc49cd24" c,
                        sublevels p
                   WHERE p.parent_id = c.child_id
                       AND   p.parent_level = c.child_level
               )
select parent_id, parent_level, geoname, weight
FROM sublevels s
inner join "us.hierarchy"."USHierarchyInfoUnion_2015_8bcc49cd24" i
         on i.geoid = s.parent_id and i.level = s.parent_level

output

010010201001    block_group Block Group 1   1
01001020100 census_tract    201 1
36067   zcta5   36067   1
0100240 school_district_unified Autauga County School District  1
01001   county  Autauga 1
0102100 puma    Elmore, Autauga, Montgomery (Outer) & Lowndes Counties PUMA 1
33860   cbsa    Montgomery, AL  1
0102    congressional_district  Congressional District 2    1
01  state   Alabama 1
juanignaciosl commented 6 years ago

OK, :green_apple: , ready for next CR.

2da7813bc240363dc80a94666c834d6aca3f5ccf will probably break some task at hierarchy computation. I had to revert the change because it broke other stuff. I think that I had to add it because "<schema>".Tablename won't work if schema contains dots, but I think that we can move on and add a fix if it fails again.

juanignaciosl commented 6 years ago

Ready for more CRs. The name case commit might have a broad impact, as we've said.

juanignaciosl commented 6 years ago

I've pushed a minor change in the helper script to display elapsed time on ongoing tasks.

antoniocarlon commented 6 years ago

I began with the acceptance yesterday (local) and it hasn't finished yet. I'll retry it this weekend.

As the task begins, I'm receiving a bunch of errors ERROR running ... from the complete method of USLevelHierarchyWeights. These are expected errors but the logging made me look into them, could we avoid logging the query and marking the errors as expected in the logs?

antoniocarlon commented 6 years ago

I have executed make -- docker-run us.hierarchy.USHierarchy --year 2015 locally and it raises Unfulfilled dependencies at run time. Reexecuting it didn't help:

[ERROR]: [pid 1] Worker Worker(salt=151657519, workers=1, host=f9b76b1b349b, username=root, pid=1) failed    tasks.us.hierarchy.USHierarchyChildParentsUnion(year=2015)
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/luigi/worker.py", line 187, in run
    raise RuntimeError('Unfulfilled %s at run time: %s' % (deps, ', '.join(missing)))
RuntimeError: Unfulfilled dependencies at run time: tasks.us.hierarchy.USLevelHierarchyWeights_block___block_group____2015_2cc12b1390, tasks.us.hierarchy.USLevelHierarchyWeights_block_group___census_tract___2015_e7749a32d6
antoniocarlon commented 6 years ago

Executed locally:

This progress looks :) because there were no failed tasks or missing external dependencies

:rocket:

juanignaciosl commented 6 years ago

Back to acceptance to finish some tasks:

I'll ping again when this is verified to be running for the tiler.

antoniocarlon commented 6 years ago

Blocked by the refactor here https://github.com/CartoDB/bigmetadata/pull/586

juanignaciosl commented 6 years ago

Ready for CR and acceptance, together with CartoDB/do_tiler/pull/38.

In local I've generated the new table and used it from the tiler successfully.

Added do not merge yet because this must not be merged into master until a new DO dump is released, because it includes low level changes at ORM level that might break other stuff. As this branch is not needed for DO but for Tilesets API, this is not a problem for CR and acceptance.

javitonino commented 6 years ago

:red_circle:

LOG_LEVEL=DEBUG make -- docker-run us.hierarchy.USDenormalizedHierarchy --year 2015
/usr/local/lib/python3.5/dist-packages/luigi/parameter.py:261: UserWarning: Parameter "year" with value "2015" is not of type string.
  warnings.warn('Parameter "{}" with value "{}" is not of type string.'.format(param_name, param_value))

It looks like we may be passing an integer (from hierarchy tasks) to ShorelineClip (which takes a string).

javitonino commented 6 years ago

Acceptance status:

Hierachy:

Tilesets:

juanignaciosl commented 6 years ago

Data has been deployed to production. This will be kept unmerged until a new DO dump is released.

juanignaciosl commented 5 years ago

This has been stuck for weeks. I'm going to merge it into Add_more_countries_to_MC_tiler and ask for CR and Acceptance on it.