MIDS-at-Duke / opioid-2023-kml

opioid-2023-kml created by GitHub Classroom
0 stars 0 forks source link

Clean census data #4

Closed katelyn-hucker closed 11 months ago

katelyn-hucker commented 11 months ago

@lisawym I was wondering if you had gotten a chance to find this data yet? Were you able to include total deaths? Thanks.

lisawym commented 11 months ago

I am currently facing challenges with obtaining population data from the National Historical Geographic Information System (NHGIS) for the years 2002-2016. The dataset consists of Vital Statistics covering 2002-2007, ACS 2009 with five-year data from 2005-2009, and ACS Groups with yearly data from 2010-2016. However, there are three issues that need to be addressed:

  1. Interpolation for 2008-2009: To fill the gap for 2008-2009, we propose calculating the values based on the available ACS 2009 data covering the period from 2005-2009.

  2. Limited County Coverage for 2010-2016: The ACS Groups data for 2010-2016 includes only counties with populations exceeding 65,000, resulting in a subset of approximately 800 counties each year. This limitation needs consideration as it excludes a significant number of the total 3,143 counties in the US.

  3. Combining ACS and Vital Statistics: The population dataset combines data from ACS and Vital Statistics. It's essential to verify the compatibility and appropriateness of merging these sources, as both contribute significantly to the overall population figures.

To address these challenges, two potential approaches are being considered:

  1. Interpolation Using Census Data: Utilizing census data from 2000-2010 and 2010-2020 to interpolate population values for each year, ensuring a continuous and comprehensive dataset.

  2. Restricting Analysis to Large Counties: Continuing to use ACS and Vital Statistics data but limiting the analysis to counties with populations exceeding 65,000. This approach helps mitigate data gaps but requires careful consideration of the representativeness of the selected counties.

In light of the missing values issue in the death data, I wonder if we can measure how much data is missing in drug-related deaths data. If a substantial proportion of counties have missing values, a viable solution could be narrowing the scope of the research to states with populations exceeding 65,000, providing a more focused analysis.

lisawym commented 11 months ago

@lisawym I was wondering if you had gotten a chance to find this data yet? Were you able to include total deaths? Thanks.

Hi @katelyn-hucker , I haven't had the opportunity to examine the total death data yet. I've collected the raw data from NHGIS and pushed it to the 'population_data' branch. Once we decide on the approach to merge the population data, I can work on merging it. Alternatively, I can attempt to merge it in two different ways for now so that we can have some merged population data and start conducting some analysis.

Do you have some time to perform some EDA on the missing values of the death data? (such as how many unique counties are there in the dataset, and how many counties of the total counties has missing values?) It will be very helpful in our decision-making process regarding choosing population data and handling missing values. Thanks!

I am working on merging population data, and also on another issue to find the best way to merge data with county information. I am not sure if I have enough time to get the total population for now. Can we both do some more exploration, and then discuss again about how we should treat the missing values in death tables? Thanks for your patience!

katelyn-hucker commented 11 months ago

Hi @lisawym, this all sounds great. I will look at the things you just mentioned. You can see some of the inital EDA on my branch. I was waiting to pull request until post missing data fix. I am OK with just looking at larger counties, however, we would have to merge total population in sooner than later so there is less missing data to fix.

katelyn-hucker commented 11 months ago

Also did you see my note about the control states for florida? Might be good idea to pull their population data before you get too far into it @lisawym