CartoDB / observatory-extension

BSD 3-Clause "New" or "Revised" License
6 stars 4 forks source link

Weight by denominator for arbitrary measures that are not summable #177

Closed talos closed 7 years ago

talos commented 8 years ago

We currently always use area weighting when estimating a measure for an arbitrary geometry, and only allow arbitrary geometries for measures that can be summed.

While it's impossible to recalculate averages or medians across arbitrary geometries without access to the microdata, we could construct reasonable estimates.

A good example of this would be calculating median income in an arbitrary shape drawn from most of one small low-income district with high population, and less of one large high-income district with low population:

district households median income area
A 100 $100,000 100
B 1000 $25,000 5

Our shape takes 40 sq km from A and 4 sq km from B.

A simple area weighting would consider both equally, yielding (25000 / (5/4)) + (100000 / (100/40)), or $60,000. This is clearly an overestimate, as if we estimated median income looking at only population (and remember, our shape actually weights the small area more) we would get ((100000 * 100) + (25000 * 1000)) / (1000 + 100) or about $31,818. Our actual estimate should be even lower than that!

The suggested solution would be to weight those population numbers off the area percentages, like this: ((100000 * 100 / (100/40)) + (25000 * 1000 / (5/4))) / ((100 / (100/40)) + (1000 / (5/4))). That yields about $28,571, which sounds like a much more reasonable estimate considering there are possibly 800 lower income households and just 40 higher income ones.

In more formal terms, I think that looks something like:

SUM(income * households / (district_area / shape_district_intersect_area)) /
SUM(households  / (district_area / shape_district_intersect_area))

And more generally:

SUM(measure * weight / (boundary_area / shape_boundary_intersect_area)) /
SUM(weight / (boundary_area / shape_boundary_intersect_area))

In order to do this, we would need to add another relation type besides denominator to link together measures like median income to the sample group they come from -- in this case, households. That relation can be read when we construct estimates for arbitrary areas.

From a metadata coding perspective, adding this relation is easy. It would be a little more work to add it into the obs_meta table for quick retrieval, but I've been mulling the idea of a JSON column in there with a summary of relations for numer, denom, and geom.

@andrewxhill @stuartlynn @ohasselblad thoughts?

stuartlynn commented 8 years ago

So I like it. I think we can test this pretty easily

  1. Take a handful of census tracts
  2. Break them up on to block groups and calculate the sum as your propose
  3. Compare it to the census track value.

This wont take in to account the overlapping regions but it will allow us to see if averages weighted by households / individual work.

talos commented 7 years ago

This is now being done (that's how we generate estimates for average measures).