GeoDaCenter / opioid-policy-scan

The Opioid Environment Policy Scan provides access to data at multiple spatial scales to help characterize the multi-dimensional risk environment impacting opioid use in justice populations across the United States.
13 stars 14 forks source link

Fill out remaining 1980, 1990, and 2000 DS01 variables #51

Closed bucketteOfIvy closed 1 year ago

bucketteOfIvy commented 1 year ago

The variables present in the DS01 historic data do not match the variables listed in the DS01 data tables documentation. This seems to be because Social Explorer's "Historic Census Data on 2010 Census Tracts" datasets do not include the counts needed for the DS01 data table documentation, which was likely caused by the historic censuses not aggregating their data directly into those relevant categories. But, as our historic DS01 data files are based on Social Explorer's data, we are also missing those categories.

However, there does seem to be a workaround for the some of the data. The historical censuses seem to have released dis-aggregated tract level race, ethnicity, age, and education attainment data from which most of the missing data can be reconstructed. I'm currently planning to download this data from IPUMS NHGIS and then crosswalk the data to 2010 census tracts using weights from the Longitudinal Tract Database, but have a few open questions about data comparability I wanted to track here that will need answered prior to merging these changes. Namely:

  1. The 1980 Census seems to have asked if respondents were of Spanish origin as opposed to Hispanic origin, which they started doing in 1990. Is it sufficient to simply note this discrepancy in documentation, or is there research indicating that the difference in wording heavily changed how respondents interpreted the question? (I don't currently believe this is the case, but it's probably wise to double check anyhow).
  2. The 1980 Census also reports "Years of School Completed" with categories such as "High School: 1-3 years" and "High School: 4 years," whereas future censuses report "Educational Attainment" with categories such as "9th to 12th grade, no diploma" and "High School graduate (includes equivalency)." At minimum, this means that any estimate of percent population with less than a high school diploma (for the noHSP variable) will exclude GEDs for the 1980 population but not for 1990 on. Are these sufficiently different that the 1980 Census education variable should be renamed or treated differently, or is it sufficient to just note this discrepancy in the documentation?
  3. Due to "major differences between the disability questions," the US Census Bureau does not advise comparisons disability data comparisons between the Censuses taken prior to 1990 and the 2000 Census. As an example of these discrepancies, disability data collected in the 1980 and 1990 Censuses only consider the civilian non institutionalized population of 16 years of age and older, whereas the 2000 Census considers the civilian non institutionalized population of 5 years of age and older. It is probably desirable to make the differences between the 1980/1990 and 2000 disability data as apparent as possible for end users. Towards that end, do we want to separate out the 1980 and 1990 disability data into a unique variable to reflect this difference in collection methodology?
bucketteOfIvy commented 1 year ago

I just committed and pushed the updated datasets and generating files to my fork of this repository, so now is probably a good time to update about the direction I've taken on these so far.

  1. Looking at the historic census questions, the wording on the Hispanic origin question appears to be fairly consistent on these three decades, so I went ahead and used the data for all three.
  2. Comparing the 1980 and 1990 data show that they are rather similar (with closer observation showing that the 1980 data depicts generally lower educational attainment than the 1990 data, which makes sense). A map showing the absolute difference between them at tract level is depicted below. absolute diffs
  3. For now, I'm excluding the 1980 and 1990 disability data due to the 2000 Census guidance.

The last update for this specific issue is that the 1980 county level data -- which is not included in this commit -- does not have a clear pathway to interpolation. Population weighted interpolation from 1980 geographies to 2010 geographies is made challenging by the lack of nationwide census tracts. Areal interpolation is likely inappropriate due to the possibility of new counties being cities which have split off from the county, and which would be anticipated to house population disproportionate to their land area. I'm plan to poke around for other options.