Open fgregg opened 7 years ago
@fgregg hey we're looking to develop this functionality at ARGO. Key need to aggregate census statistics like median income correctly for our California water agency partners which have service area boundaries that don't align nicely with census boundaries. You know the story :)
We have a team of CUSP grad students looking to sprint on this mid December to mid January and would love your thoughts. Plan is a simple fork for the sprint and then can PR assuming everything works nicely :)
Sounds great!
I think I would start by following the Census's guidance on aggregating statistics https://www.census.gov/content/dam/Census/library/publications/2018/acs/acs_general_handbook_2018_ch08.pdf
It would be very, very nice to make use of the variance data that the census has started to make available. https://www.census.gov/programs-surveys/acs/data/variance-tables.html but that's probably a phase II or phase III project.
I'd also recommend that you develop the aggregation code in a separate files from the existing ones, as it may be nice, in the future, to pull the aggregate code into a separate library.
Hey @fgregg - I put together an initial project board for our team of students. I will be continuing to update that, but wanted to drop it in this thread for those interested.
I also wanted to run the actual technical approach by you all to increase the probability of things lining up nicely.
So right now looks like there is a family of .geo_X() methods that can return geojson-like structures with statistics and geometries for lower level census geographies within higher level ones as well as for arbitrary geometries. (Though for sf3, the naming convention changes?)
One approach came to mind that would act pretty independently of the existing codebase, which would allow us to pull things into a separate library if that ends up feeling better. In this approach, one would create a new aggregator function that takes as inputs the statistic and geometry outputs of the .geo_X() methods along with the type of statistic to aggregate and the geometry to aggregate to--thinking is that this last piece would be necessary to properly downscale the statistics for the partial edge geometries.
So something like:
def new_aggregator_function(
list_of_dictionaries_with_statistic_and_geometry,
type_of_statistic,
geometry_to_aggregate_to
):
areally_interpolated_statistics = check_for_edge_geometries_and_downscale_statistics(...)
aggregated_statistic = aggregate(areally_interpolated_statistics, type_of_statistic)
return aggregated_statistic
Any feedback there?
Lastly, on the Census Data API side of things, the table and attribute names do seem cryptic--e.g.B25034_010E
. I found this reference, but still feels pretty dense.
The human-readable table/attribute name --> code direction might be tough, but the other direction doesn't seem too far-fetched and it would really be great if these codes were parsable for type of statistic. This could be used to help prevent statistical gotchas like trying to aggregate a median like an average. Not sure if you all have thought about this bit. May be for down the road though. Hopefully explicitly asking the user to provide type of statistic is a reasonable enough solution for now.
cc: @patwater @christophertull
could you you tell me a little bit more about what you mean by "necessary to properly downscale the statistics for the partial edge geometries"?
I think it's reasonable to have the user supply the type of aggregation in the first phase. There's a lot that could be done to infer what type of aggregation is appropriate, but that can wait.
Do you mean that the desired shape can cut across census geographies, and you'll need to figure out what data to apportion?
Yep, that's all I meant by that. We see that with California water district boundaries for example.
Okay, finding the intersections is a fairly expensive operation.
When we do it here:
It would be probably be a good idea to go ahead and return the proportion of the census tract covered falling withing the target geography, and stuff it into the statistics dictionary.
That's coverage proportion is probably what you need you would be calculating with check_for_edge_geometries_and_downscale_statistics
anyway.
If you did it that way, you would only need "sequence of statistics", "sequences of weights", "type of statistics"
Nice, thanks a lot Forest. I'll look into that.
weights are going to be important as, for example, sometimes you'll want to know size of the associated population. Anyway, i think you have enough to move forward.
@dmarulli, any updates on your project?
His student team has there kickoff call scheduled for this upcoming Friday 12/21 so probably not.
@fgregg FYI the functionality to calculate the areal interpolation is getting pretty close though some outstanding refactoring to clean up the student code. See here for the latest: https://github.com/argo-marketplace/census_area/tree/dev_branch
Do you A) have any stylistic preferences on integration to note and B) capacity to help with that integration (bit swamped on our end)? Thanks much!
Hi @patwater, this looks like it's pretty far from ready to be brought in. There are some nice ideas in here, but
I'm sorry to hear that you don't have the bandwidth to work on the integration. Let me know when you do.
Yeah I hear you. Part of working with grad students early in their program... will keep you posted.
Some interest reviving here (also I want my Hacktoberfest contributions ;).
@fgregg I see your reference to census-data-aggregator
above. Would it make sense to use census_area
to fetch the data for our census units of interest and then feed that into census-data-aggregator
?
It would be great if census_area handled the aggregations of census variables correctly.
Prior art