Open pfjel7 opened 7 years ago
The graph below provides preliminary diagnostics for identifying additional census tracts with possibly inaccurate residential unit counts. Specifically, it shows the mostly linear relationship between residential units and population counts for each census tract. A number of tracts, however, break that pattern, including the three discussed above. Individual tract IDs are revealed by hovering over the data points in this linked copy of the graph.
A @pfjel7 this is a great analysis of this issue. I think we'll want to look at the records of the MAR table itself for each of those three zones to see what's going on. My guess is that the MAR has missing data in the res unit count field - once we look at that and if this indeed seems to be the case, we can reach out to the DC gov contact that maintains the MAR and see what they say.
The calculations of several rates in the zone_facts table (for building and construction permits already, and eventually also for the percentage of units that are subsidized per zone: see #564 and #574) all depend on the accurate determination of the total number of residential units in each zone.
Unfortunately, our current method for calculating those totals is unreliable. The most glaring illustration of that inaccuracy is the census tract 68.04. While the tract has a total population of nearly 3,000, according to our census data, it has, according to our current count, only 13 residential units: a crazy average of 227 people per unit. Tracts 2.01 and 62.02 have similarly implausible ratios of people to residential units: 3685/168 & 117/6.
Our current method for calculating the number of residential units per zone is to sum the values of active_res_occupancy_count from the mar table for each property by zone (See commit hash 7435937 and pull request #556.) This method replaced the method in cama.py used previously. (For initial guidance on developing that method, see issue #493. Please note: although Neal suggested in opening the issue that we sum the values of active_res_unit_count, that column turned out not to have values for properties with single- or owner-occupancy. Hence, our use instead of the occupancy column.) While this new method generally provides credible totals for most of the city, the three under-counted tracts suggest either an incompleteness in the mar table or mistakes in assigning property addresses to zones.
To fix this, we need either to find the missing data, correct the mistaken zone assignments, or estimate the implausible values by modeling those from other data (such as population counts, aggregate income, number of transit stops, etc.).
To help those of you who know more about the city and the available data than I do, I provide here an image that illustrates the degree and location of the greatest disparities.