Drexel-UHC / NETS

FIles use to wrangle National Establishment Time Series datasets for neighborhood public health research
1 stars 0 forks source link

Create NETS2019 Tract Measures #9

Closed sjmelly closed 1 year ago

sjmelly commented 1 year ago

-Use database from #7 -Calculate counts and density per sq km land area for each year 1990-2019 for all base groups -Use base group counts to calculate counts and densities for combocats and thematic constructs -Proposed format for output Long by year

LOCATIONID (RECVD t10_cen_uid_u_2010) YEAR MEASURE* VALUE LOCATIONTYPE [GEOID10] BASEGROUP [for linking to BG_CC_TC_Xwalk to link to combocats and thematic constructs]

*MEASURE name in analytic datasets will use RECVD format T10_net_nnn_c count of category nnn T10_net_nnn_d density of category nnn

@stfran22 I saved an Excel file with variable names following the RECVD convention using basegroups from NETS_BaseGroupFreqs20230530.xlsx and combocats and thematic constructs from NETS_Catalogue_3LTR_Variables_20230616.xlsx

sjmelly commented 1 year ago

Proposed long format agreed to First step base group measures Next measures for combo cats and thematic constructs Hierarchy measures in consultation with @kam642

kam642 commented 1 year ago

@sjmelly Is there something I need to answer about the hierarchy?

sjmelly commented 1 year ago

@stfran22 is working on calculating tract measures for base groups, combo cats and thematic constructs. Should he also be creating measures using hierarchies, or will someone else do that? Is their existing code for this?

sjmelly commented 1 year ago

Output table will be saved to NETS Azure SQL Database

stfran22 commented 1 year ago

Tract measures (counts and densities for all categories) saved to \\files.drexel.edu\colleges\SOPH\Shared\UHC\Projects\NETS\Data\NETS2019_Python\NETS_tr10_measures20230731.txt. Code is in NETS github repository.

Data check was done by comparing subset of dataset from Philadelphia years 2000 & 2010 \\files.drexel.edu\colleges\SOPH\Shared\UHC\Projects\NETS\Data\NETS2019_Python\NETS_tr10_measure_philly20230731.txt to related Pediatric Big Data variables Z:\UHC_Data\NETS_UHC\SAS\panjdemdnetsv5_uhc_den_122018.SAS7BDAT. Code is in NETS github repository and histograms and summary are in updated \\files.drexel.edu\colleges\SOPH\Shared\UHC\Projects\NETS\Documentation\NETS Processing Documentation 20230801.docx

sjmelly commented 1 year ago

@stfran22 take a look at scatterplots for Philly NETS tract measures

stfran22 commented 1 year ago

@sjmelly do you want the 2014 data as one axis and 2019 data on another?

example: NETS_usr00scat

sjmelly commented 1 year ago

Yes. The plot above is reassuring. This confirms what we already know that NETS vintages have differences, but the differences are probably not enough to change results in public health research [We don't have to question the validity of Pediatric Big Data results that used the 2014 data]. There's no strong trend in the differences - it doesn't look like 2019 deleted or added enough records to make the slope much different from 1. I would expect that the pattern for different categories will be similar, although I suppose some businesses with higher turnover might have more differences. For MESA using nationwide data the few outliers probably won't make a difference. For Food & Communities in Philadelphiia metro area we may want to explore outliers.

stfran22 commented 1 year ago

I've added scatter plots for AUR, CVP, HSR, and USR for 2000 and 2010 to \\files.drexel.edu\colleges\SOPH\Shared\UHC\Projects\NETS\Documentation\NETS Processing Documentation 20230801.docx

sjmelly commented 1 year ago

There are 72,246 2010 tracts with landarea > 0 in conterminous US (in CT_2010_ContUSlandRepairAlbers) in NETS_tr10_measures20230731.txt the number of records per year varies (1990 N = 71796; 2012 N = 72123) presumably because there are tracts with no establishments which vary by year and tracts with no establishments are missing from the dataset. Will this create a problem in analyses - Should this dataset contain records for all 72,246 tracts for every year with zero values if there are no establishments?

stfran22 commented 1 year ago

This makes sense to me. Let me know if you'd like me to change it.

kam642 commented 1 year ago

@sjmelly @stfran22 Saw this come across my feed. It would be a lot easier/clearer for users if the dataset contains all tracts for all years and assign 0s for units where there are no establishments.

stfran22 commented 1 year ago

I'll backfill this dataset by outer merging file with all tracts & years

stfran22 commented 1 year ago

Backfill complete. New file \\files.drexel.edu\colleges\SOPH\Shared\UHC\Projects\NETS\Data\NETS2019_Python\NETS_tr10_measures20230823.txt with 2220030 records (74,001 tracts * 30 years) includes tract-years with no category counts/densities. All count/density columns are float datatypes and no null/nan values are present.

I did a value count of the number of tracts with no category counts/densities per year, and each year has ~2000 tracts. Breakdown below.

Out[12]: 1990 2229 1991 2212 1992 2201 1993 2159 1994 2144 1995 2104 1996 2069 1997 2045 1998 2030 1999 2020 2000 2005 2001 1979 2002 1959 2003 1950 2004 1943 2005 1933 2006 1921 2017 1921 2015 1920 2016 1920 2018 1920 2019 1919 2007 1915 2010 1915 2014 1915 2008 1909 2011 1909 2013 1909 2009 1906 2012 1902

sjmelly commented 1 year ago

74,001 is number of tracts in ACS measures on server - these include tracts outside contiguous US and tracts with zero landarea. So actual number of contiguous US tracts missing NETS is in the hundreds not 2000 (Missing in 1990 N = 450; Missing in 2019 N = 140) for a web map see https://drx.maps.arcgis.com/apps/mapviewer/index.html?webmap=4a0c97660f3b48ca828df3687344a2c9

stfran22 commented 1 year ago

@sjmelly I added a data dictionary to the network drive \\files.drexel.edu\colleges\SOPH\Shared\UHC\Projects\NETS\Documentation\BEDDN2019 Tract-Level Data Dictionary.docx. Let me know if you think this is sufficient - if so, I think we can move this issue to "done".

sjmelly commented 1 year ago

NETS_tr10_measures20230823.txt includes AK (FIPS 02), HI (FIPS 15) and PR (FIPS 72) with all zeros. These should be removed, Should tracts with arealand = 0 be included in the dataset (N = 359)? It looks like these all have zero NETS counts and densities but I think this is because when you did the spatial join to tracts you excluded tracts with arealand = 0. Tract 53009990100 has 6 NETS locations; all have ct10_distance > 0 and were assigned to other tracts. I vote for excluding the arealand = 0 tracts from the dataset.

sjmelly commented 1 year ago

I added some comments to BEDDN2019 Tract-Level Data Dictionary.docx and saved it with today's date.

stfran22 commented 1 year ago

@sjmelly This new file does not have non-contiguous tracts and tracts with arealand=0. \\files.drexel.edu\colleges\SOPH\Shared\UHC\Projects\NETS\Data\NETS2019_Python\NETS_tr10_measures20230913.txt

stfran22 commented 1 year ago

@sjmelly I resolved/responded to comments and left a new version of the data dictionary with my initials.

sjmelly commented 1 year ago

@stfran22 the data dictionary looks good, but there are 24 tracts in contiguous US with arealand > 0 that are missing. See attached

missingtr10measures20230913.csv

stfran22 commented 1 year ago

Ah I was using 2019 tract level census data to backfill instead of 2010. New file loading now with today's date \\files.drexel.edu\colleges\SOPH\Shared\UHC\Projects\NETS\Data\NETS2019_Python\NETS_tr10_measures20230914.txt

sjmelly commented 1 year ago

\ResearchAndData\UHC_Data\CensusACS15_19\data\tractchange_10_19.sas7bdat identifies 4 tracts that changed 2010-2019. Do you have information that additional tracts changed?

stfran22 commented 1 year ago

When comparing: 2010 tracts \UHC_Data\Census2010\data\sf1_census2010_tract.sas7bdat to 2019 tracts \UHC_Data\CensusACS15_19\data\census2015_2019_tract.sas7bdat

There are 26 tracts that are in the 2019 dataset but not in the 2010 dataset, and 25 tracts for the reverse.

tracts2010_notin2019.csv tracts2019_notin2010.csv

sjmelly commented 1 year ago

@reynegad We have noticed an inconsistency between the tracts in the 2010 decennial census and the tracts in the ACS measures that use 2010 boundaries. I believe we worked on making sure the ACS measures all have the same number of records and use the same tract IDs, but I guess we never compared them to the decennial census. Do you have any ideas about why there is the discrepancy? Maybe they used different versions of the tracts apportionment vs redistricting vs other?

At this time no further action is needed to investigate this difference. If we put decennial census data on portal we should revisit this issue.

When comparing: 2010 tracts \UHC_Data\Census2010\data\sf1_census2010_tract.sas7bdat to 2019 tracts \UHC_Data\CensusACS15_19\data\census2015_2019_tract.sas7bdat

There are 26 tracts that are in the 2019 dataset but not in the 2010 dataset, and 25 tracts for the reverse.

tracts2010_notin2019.csv tracts2019_notin2010.csv

sjmelly commented 1 year ago

Code for creating tract measures saved to https://github.com/Drexel-UHC/NETS/blob/main/python/measures/get_NETS_tract_measures.py and https://github.com/Drexel-UHC/NETS/blob/main/python/measures/backfill_tract_measures.py