Iss220 -- broadband access

malcalakovalski commented 4 months ago

This metric shows the ratio of households (or people in households) with the presence of a computer and a type of broadband internet subscription. In particular, the unit of analysis for the overall and income subgroup is households, whereas the unit of analysis for the race/ethnicity subgroup is people in households.

This metric is calculated using digital_access.qmd in the 08_education folder, which pulls 1-year ACS data for the overall metric and 5-year ACS data for the subgroups.

Note: While this branch is ready for review, it is still missing confidence interval calculations for the metric. Therefore, there aren't columns with _lb and _ub suffixes in the final data yet. @cdsolari will provide guidance on how to proceed.

malcalakovalski commented 3 months ago

Confidence intervals and updated data quality flags have been added

tinatinc commented 3 months ago

OK everything else in the code looks good! Except you have an issue in your subgroup output files: for both "digital_access_county_subgroup_all.csv" and "digital_access_city_subgroup_all.csv", the subgroup_type variable for the income subgroups reads as "NA" -- should read as "income". Also you should have a third observation paired with each income breakdown (like you do with race) that has the "all" value.

I also made a note in the documentation that you seem to have included 2020 data, so we should note that there.

tinatinc commented 3 months ago

Hey @malcalakovalski - re-reviewed and things look good. I can't complete review until census API is running again, so waiting on that. Just noting -- a lot of missings in the non-subgroup files, and I'm not seeing a count of missings for each output file in the code. Am I right that, for example, for the non-subgroup overall county-level data, we only have this metric for like 25% of counties?

malcalakovalski commented 3 months ago

Hey @tinatinc. It looks like the reason we have a large number of missing values in the non-subgroup county files is that we switched from ACS 5-year to ACS 1-year. I found that there were only 4,969 observations right after pulling the county overall ACS 1 year data from tidycensus. You can check this by running the chunk labelled "pull-overall-communities". This translates to between 820 and 837 observations per year.

By comparison, running:

tidycensus::get_acs(geography = "county",
                    variables = "B28003_004",
                    year = 2021,
                    survey = "acs5")

gives me 3,221 observations.

So the good news is nothing in the code is inadvertadely creating these NA's or removing observations we want. However, we may or may not want to revisit using the 1 year ACS given the significant increase in missing values. @cdsolari how should we proceed?

tinatinc commented 3 months ago

Hey @tinatinc. It looks like the reason we have a large number of missing values in the non-subgroup county files is that we switched from ACS 5-year to ACS 1-year. I found that there were only 4,969 observations right after pulling the county overall ACS 1 year data from tidycensus. You can check this by running the chunk labelled "pull-overall-communities". This translates to between 820 and 837 observations per year.

By comparison, running:
tidycensus::get_acs(geography = "county",
                    variables = "B28003_004",
                    year = 2021,
                    survey = "acs5")
gives me 3,221 observations.

So the good news is nothing in the code is inadvertadely creating these NA's or removing observations we want. However, we may or may not want to revisit using the 1 year ACS given the significant increase in missing values. @cdsolari how should we proceed?

Thank you for checking on this Manu!! OK, I spoke with Claudia about this -- because the coverage is so low (like 1/4 of all counties), we think it's probably better to switch back to 5-year estimates for the overall values as well. Is this something you can do?

malcalakovalski commented 3 months ago

Hey @tinatinc. It looks like the reason we have a large number of missing values in the non-subgroup county files is that we switched from ACS 5-year to ACS 1-year. I found that there were only 4,969 observations right after pulling the county overall ACS 1 year data from tidycensus. You can check this by running the chunk labelled "pull-overall-communities". This translates to between 820 and 837 observations per year. By comparison, running:
tidycensus::get_acs(geography = "county",
                    variables = "B28003_004",
                    year = 2021,
                    survey = "acs5")
gives me 3,221 observations. So the good news is nothing in the code is inadvertadely creating these NA's or removing observations we want. However, we may or may not want to revisit using the 1 year ACS given the significant increase in missing values. @cdsolari how should we proceed?
Thank you for checking on this Manu!! OK, I spoke with Claudia about this -- because the coverage is so low (like 1/4 of all counties), we think it's probably better to switch back to 5-year estimates for the overall values as well. Is this something you can do?

Absolutely! I'll have this done in the next hour

malcalakovalski commented 3 months ago

@tinatinc I switched to the 5-year ACS for overall and it greatly improved data completeness, especially for more recent years.

tinatinc commented 3 months ago

@tinatinc I switched to the 5-year ACS for overall and it greatly improved data completeness, especially for more recent years.

Yep, looks great Manu! I think this is the right approach for now. Just noting one last thing: the counts in the final files are slightly off.. We have 6 years (2016-2022 without 2020), 3143 counties, 486 cities. So the county subgroup file should have 94,290 observations, but only has 94,280 (10 obs less). The city subgroup file should have 56,574 observations, but only has 56,569 (5 obs less).

Same with the overall figures -- years 2016-2019 seem to have 3142 counties, and 2021 and 2022 have 3144. 2016 and 2017 has 485 cities, but 2018-2022 have 486. These are probably where the subgroups discrepancies are coming from, too. I don't really see an error or random drops in the code.. do you already know what's going on there?

tinatinc commented 3 months ago

Note for posterity: all inconsistency in observations are due to the crosswalk files, so we are leaving as is for now

malcalakovalski commented 3 months ago

@tinatinc and @cdsolari caught a few discrepancies in this metric. The next commit will address the following issues:

Drop 2016 since it is all missing values
Remove the percentage sign in output share values?
Remove state_name & county_name or place_name from the final files (just want the year, relevant FIPS codes, metric, and data quality)
Change the race-ethnicity labels from "White, Non-Hispanic" and "Black, Non-Hispanic" to "White" and "Black"
Make sure subgroup_type comes before subgroup in the ordering of the variables in the output files
Change the income subgroup labels to: Less than $50,000, and $50,000 or More (because currently it isn't clear which bucket $50k falls into)

UI-Research / mobility-from-poverty

Iss220 -- broadband access #294