Open Gastastrophe opened 2 years ago
Since we have no knowledge of the actual distribution of the population in a census block, the error in the point cloud distribution cannot be determined (as far as I know). We will therefore treat the point cloud as a series of constants.
Since the assignment of 2010 Census block data to points on our point cloud is the application of n (1 for each point) functions f_n(x)
which take the population of the census block as an input and multiply by the weight corresponding to that point, we can apply the formula for propagating error through a constant function df_n = |w_n| * dx
where w_n
is the weight for the nth point and d
is the error operator. More formally, we can consider f_n(x_1, ... , x_k) = <0, ... , w_n, ... 0> dot <x_1, ... , x_k)>
where x_k
is the kth census block and w_n
occupies the position aligned with the block containing the nth point, allowing {f_n}
to be a family of functions over the same variables with error functions df_n = |w_n| * dx_k
. Since k is determined entirely by n for the purposes of this process, we can instead write df_n = |w_n| * dx_n
.
When propagating error in variables of interest being passed to points on the point cloud, we first calculate the error from converting the variables to ratios with the population. Since this is simple division, the error of the function g(a,b) = a/b
where a
is a variable of interest and b
is the population is dg = |g(a,b)| sqrt([da/a]^2 + [db/b]^2)
. Next, we find the error in applying these ratios to points, which is the simple function h(r,p) = r * p
where r
is the ratio derived from function g
and p
is the population at the point derived from function f
. The error formula for h
is then dh = |h(r,p)| sqrt([dr/r]^2 + [dp/p]^2)
.
Fixing a variable of interest (since this process is independent for each variable), we then get the updated error function for the variable at the nth point dh_n(x_n,a,b) = |a/b * w_n * x_n| sqrt([sqrt([da/a]^2 + [db/b]^2))]^2 + [dx_n/x_n]^2)
. As a note, when interpolating population data, there is no reason to transfer data from census tracts since we are already using population counts at the census tract block level, and hence the error for this variable remains as df
.
For the last step, we sum these points to 2020 census tracts using a family of functions t_m(p_1, ... , p_n) = sum_{i=1}^n (delta_{m,i} p_i)
where p_i
is the value obtained from h_i
, delta_{m,i} = 1
if p_i
is in the mth tract and delta_{m,i} = 0
otherwise. This gives us our final family of error functions for interpolated measurements dt_m(a, b) = sqrt( sum_{i=1}^n delta_{m,i} |a/b * w_i * x_i|^2 [sqrt([da/a]^2 + [db/b]^2))]^2 + [dx_i/x_i]^2)
As a note, since margins of error are only published in ACS, ACS data must be used when propagating error. Consequently, since ACS does not report block level data, we treat the census block population as a constant and use the updated error function dt_m(a,b) = sum_{i=1}^n delta_{m,i} * w_i^2 * x_i^2 * |g(a,b)|^2 * ([da/a]^2 + [db/b]^2)
Closing since the propagated error is extremely high. The linked branch will remain open for future development.
Reopening as we attempt to find a different way to propagate error
There is a square root missing from the final error calculation, so the correct function is in fact
dt_m(a,b) = sqrt[ sum_{i=1}^n delta_{m,i} * w_i^2 * x_i^2 * (a/b)^2 * ([da/a]^2 + [db/b]^2) ]
This was done correctly in the implementation, but was documented incorrectly.
See the updated work in this picture
Since the process involves multiplying by constant weights, adding, and dividing by constant weights, it should be possible to interpolate the error measurements directly to propagate the error.