datamade / census_area

:large_blue_diamond: Get Census Data from the API for arbitrary areas
MIT License
44 stars 10 forks source link

Aggregate functions #6

Open fgregg opened 7 years ago

fgregg commented 7 years ago

It would be great if census_area handled the aggregations of census variables correctly.

Prior art

patwater commented 5 years ago

@fgregg hey we're looking to develop this functionality at ARGO. Key need to aggregate census statistics like median income correctly for our California water agency partners which have service area boundaries that don't align nicely with census boundaries. You know the story :)

We have a team of CUSP grad students looking to sprint on this mid December to mid January and would love your thoughts. Plan is a simple fork for the sprint and then can PR assuming everything works nicely :)

fgregg commented 5 years ago

Sounds great!

I think I would start by following the Census's guidance on aggregating statistics https://www.census.gov/content/dam/Census/library/publications/2018/acs/acs_general_handbook_2018_ch08.pdf

It would be very, very nice to make use of the variance data that the census has started to make available. https://www.census.gov/programs-surveys/acs/data/variance-tables.html but that's probably a phase II or phase III project.

I'd also recommend that you develop the aggregation code in a separate files from the existing ones, as it may be nice, in the future, to pull the aggregate code into a separate library.

dmarulli commented 5 years ago

Hey @fgregg - I put together an initial project board for our team of students. I will be continuing to update that, but wanted to drop it in this thread for those interested.

I also wanted to run the actual technical approach by you all to increase the probability of things lining up nicely.

So right now looks like there is a family of .geo_X() methods that can return geojson-like structures with statistics and geometries for lower level census geographies within higher level ones as well as for arbitrary geometries. (Though for sf3, the naming convention changes?)

One approach came to mind that would act pretty independently of the existing codebase, which would allow us to pull things into a separate library if that ends up feeling better. In this approach, one would create a new aggregator function that takes as inputs the statistic and geometry outputs of the .geo_X() methods along with the type of statistic to aggregate and the geometry to aggregate to--thinking is that this last piece would be necessary to properly downscale the statistics for the partial edge geometries.

So something like:


def new_aggregator_function(
          list_of_dictionaries_with_statistic_and_geometry,
          type_of_statistic,
          geometry_to_aggregate_to
     ):

     areally_interpolated_statistics = check_for_edge_geometries_and_downscale_statistics(...)

     aggregated_statistic = aggregate(areally_interpolated_statistics, type_of_statistic)

     return aggregated_statistic

Any feedback there?


Lastly, on the Census Data API side of things, the table and attribute names do seem cryptic--e.g.B25034_010E. I found this reference, but still feels pretty dense.

The human-readable table/attribute name --> code direction might be tough, but the other direction doesn't seem too far-fetched and it would really be great if these codes were parsable for type of statistic. This could be used to help prevent statistical gotchas like trying to aggregate a median like an average. Not sure if you all have thought about this bit. May be for down the road though. Hopefully explicitly asking the user to provide type of statistic is a reasonable enough solution for now.

cc: @patwater @christophertull

fgregg commented 5 years ago
  1. could you you tell me a little bit more about what you mean by "necessary to properly downscale the statistics for the partial edge geometries"?

  2. I think it's reasonable to have the user supply the type of aggregation in the first phase. There's a lot that could be done to infer what type of aggregation is appropriate, but that can wait.

fgregg commented 5 years ago

Do you mean that the desired shape can cut across census geographies, and you'll need to figure out what data to apportion?

dmarulli commented 5 years ago

Yep, that's all I meant by that. We see that with California water district boundaries for example.

fgregg commented 5 years ago

Okay, finding the intersections is a fairly expensive operation.

When we do it here:

https://github.com/datamade/census_area/blob/5e62f7d114efd6076916ed6ecffcb7ff76bf4dd6/census_area/core.py#L62-L63

It would be probably be a good idea to go ahead and return the proportion of the census tract covered falling withing the target geography, and stuff it into the statistics dictionary.

That's coverage proportion is probably what you need you would be calculating with check_for_edge_geometries_and_downscale_statistics anyway.

If you did it that way, you would only need "sequence of statistics", "sequences of weights", "type of statistics"

dmarulli commented 5 years ago

Nice, thanks a lot Forest. I'll look into that.

fgregg commented 5 years ago

weights are going to be important as, for example, sometimes you'll want to know size of the associated population. Anyway, i think you have enough to move forward.

fgregg commented 5 years ago

@dmarulli, any updates on your project?

patwater commented 5 years ago

His student team has there kickoff call scheduled for this upcoming Friday 12/21 so probably not.

patwater commented 5 years ago

@fgregg FYI the functionality to calculate the areal interpolation is getting pretty close though some outstanding refactoring to clean up the student code. See here for the latest: https://github.com/argo-marketplace/census_area/tree/dev_branch

Do you A) have any stylistic preferences on integration to note and B) capacity to help with that integration (bit swamped on our end)? Thanks much!

fgregg commented 5 years ago

Hi @patwater, this looks like it's pretty far from ready to be brought in. There are some nice ideas in here, but

  1. there are many extraneous files
  2. the interface is very different than the current library
  3. the code needs to be split out of the one giant method
  4. it's out of sync with master
  5. there are no tests

I'm sorry to hear that you don't have the bandwidth to work on the integration. Let me know when you do.

patwater commented 5 years ago

Yeah I hear you. Part of working with grad students early in their program... will keep you posted.

christophertull commented 5 years ago

Some interest reviving here (also I want my Hacktoberfest contributions ;).

@fgregg I see your reference to census-data-aggregator above. Would it make sense to use census_area to fetch the data for our census units of interest and then feed that into census-data-aggregator?