22. Add share of households with broadband access in the home - Githubissues

UI-Research / mobility-from-poverty

https://ui-research.github.io/mobility-from-poverty/

6 stars 1 forks source link

22. Add share of households with broadband access in the home #220

Open awunderground opened 1 year ago

awunderground commented 1 year ago

Pillar: High-Quality Education Predictor: Digital Access Metric: Share of households with broadband access in the home

[x] Rename to share_digital_access and share_digital_access_quality
[x] Add 2022, 2019, 2018 data - CORRECTION, was a 5-yr ACS metric for the overall community AND the subgroups, but we are altering this to mirror what we do for the other ACS-based metrics. See note below.
[x] Add income subgroup
[x] Add confidence interval

cdsolari commented 9 months ago

Our label for the digital access metric and descriptions of this metric may not align with what we ended up producing. We should double-check how these align and which version of the metric is the most reliable and valuable for our audience. If we keep the current metric values, we should update all the language surrounding it. If we revert to another metric, we might want the version that aligns with what we originally descripted, and we update the data instead of the language.

cdsolari commented 9 months ago

Our investigation found that the quality of this census table data should be fine. Rather than broadband access, it is actually a combination of having both a computer and broadband access. The 5-yr ACS version Cook County example is here: https://data.census.gov/table/ACSDT5Y2021.B28003?q=B28003:%20PRESENCE%20OF%20A%20COMPUTER%20AND%20TYPE%20OF%20INTERNET%20SUBSCRIPTION%20IN%20HOUSEHOLD&g=050XX00US17031. However, this metric was calculated in a bit of a race last round, and we want to have it more closely resemble the other ACS-based metrics in that for the county or city overall, we use the 1-yr ACS data (https://data.census.gov/table/ACSDT1Y2021.B28003?q=B28003:%20PRESENCE%20OF%20A%20COMPUTER%20AND%20TYPE%20OF%20INTERNET%20SUBSCRIPTION%20IN%20HOUSEHOLD&g=050XX00US17031&tid=ACSDT5Y2019.B28003). That allows us to calculate change for the overall community over time every year. When we use 5-yr ACS, we only show every other year because we want at least 2/5 of the data to be new to reflect change. We did realize that the universe of people for the race/ethnic breakdown is at the population in households while the overall value has the universe of households, meaning that these are not comparable. We are still digging around. And we are looking to try to figure out the right income categories and data tables to pull. This is not yet resolved, but I am including notes to highlight where we're at.

cdsolari commented 8 months ago

This is the plan: We are going to stick with using the census tables to pull in our data. But, for the communities overall (not the subgroups), we are using the 1-yr ACS. The tables only do the computer and type of internet subscription with the unit of analysis as the household. This will be based on table B28003. Ideally, we would have this information for every year, from 2016 to the most recent (2022). We don't want earlier than 2016, because they changed the way they worded the questions that year and they are not comparable with prior years. For the race/ethnic subgroups, this will be based on people in households, so a different unit of analysis from the overall (which is households). And, this will be based on the 5-yr ACS. We will use the subject matter tables rather than the table we were using because the subject matter tables allow us to get the "all" comparison for the matching unit of analysis (people in households). This will come from table S2802. The categories are the same as before: Black non-Hispanic, Hispanic, Other races and ethnicities, White non-Hispanic, and we have data to pull the "All" corresponding to the total population in households. If this doesn't work (or we don't have time to do this), the back-up plan is to go with the original tables in the program and we just can't have a comparison "All" category. The years for this should be: 2021, 2018, 2016.

For the income subgroups, this will be based on a different unit of analysis than the race/ethnic subgroups, but the same for the overall - households. This will be based on the 5-yr ACS from the table B28004. The subgroups should be less than $50,000, $50,000 or more. The years for this should be: 2021, 2018, 2016.

Please raise any problems you see or if anything is time consuming. We can trim down the request if needed! Thank you!

cdsolari commented 8 months ago

Census Tables show the 90% MOE instead of 95%. But, we have various forms of lower bounds and upper bounds depending on the metric. I think if we had more time, we'd follow the guidance you found to divide the MOE by 1.645 and then multiply it again by 1.96. If doing that means we'd miss your deadline for today, then we should just grab the published MOE.

As an update, the tidy census has code to alter census tables from the 90% MOE to 95%. That code is now in there.

cdsolari commented 8 months ago

Census table S2802 doesn't have data for 2016 for the race/ethnicity tables. We will skip that for now, but keep years 2018 and 2021. We are prioritizing more recent data. We also explored if the B28009 series of tables could help fill in the gap, but those also aren't available for 2016.

cdsolari commented 8 months ago

To calculate confidence intervals, use the 95% MOE (use tidy census to convert from 90% to 95% for the estimates) and the estimate. The estimate is the weighted figure published in the census table for the number of people in households OR the number of households. The MOE will be a plus of minus some value that is of the same unit. When you calculate the digital access metric, you take the estimate for with a computer with a broadband subscription over the total estimate for that unit of analysis (all people in households OR all households). To get the lower bound, your will take the estimate with digital access and subtract the value of the MOE, and then take that number over the total estimate for that unit of analysis. Repeat that for the upper bound but add the value of the MOE to the estimate. For an example in Cook County using the 2021 5-yr subject tables, we see the estimate with a computer with broadband for the "All" or total (4,640,934). First, the share value is going to be the 4,640,934/ 5,176,131, which gives us the .897 (or 89.7% of all people in households with a computer have a broadband subscription. The and the MOE is plus or minus 9,797. So, the lower bound is (4,640,934- 9,797)= 4,631,137 with the percentage out of the total it would be 4,631,137/5,176,131=.895, so that would be the lower bound. And the upper bound is (4,640,934 + 9,797)= 4,650,731 and with the percent out of the total, that would be 4,650,731/5,176,131= .898 and that would be the upper bound. In the data dashboard, these will show as 89.7% estimate with a confidence interval of (89.5%, 89.8%). Your values in your CSV files will remain between 0 and 1.

cdsolari commented 8 months ago

To determine data quality, if any of the census table values are suppressed, those should have the metric value show as "NA" and the data quality should also be "NA." We can otherwise use the width of the MOE to determine quality. One suggestion is if the value minus the lower bound is <0.1, then the quality is strong (_quality=1); if it is =>.1 and <.2, the quality is marginal (_quality=2), and if it is >=.2, then the quality is weak (_quality=3). We might want to look at the distribution for each of the subgroups and the overall value to see if this seems like a good measure. In the end, this is a bit of arbitrary cut-offs, but they seem reasonable.

awunderground commented 8 months ago

Closed with #294

cdsolari commented 8 months ago

Note that the estimates were far too imprecise when we moved to 1-yr ACS tables, so we went back to 5-yr ACS.

malcalakovalski commented 8 months ago

@tinatinc and @cdsolari caught a few discrepancies in this metric. The next commit will address the following issues:

Drop 2016 since it is all missing values
Remove the percentage sign in output share values?
Remove state_name & county_name or place_name from the final files (just want the year, relevant FIPS codes, metric, and data quality)
Change the race-ethnicity labels from "White, Non-Hispanic" and "Black, Non-Hispanic" to "White" and "Black"
Make sure subgroup_type comes before subgroup in the ordering of the variables in the output files
Change the income subgroup labels to: Less than $50,000, and $50,000 or More (because currently it isn't clear which bucket $50k falls into)