Closed AmandaDoyle closed 1 year ago
reviewing source data in Sharepoint for completeness
EDDE_2023_Updates_transportation.xlsx
access_subway_and_access_ADA.py
ADA_Subway_Qtr_Mile_Access
sheet: missing total population counts by PUMA (total_pop_from_census_2020
) and missing the population within 1/4 mile of ADA subway stations (pop_within_1/4_mile_of_ada_subway_stations
) which we need in order to aggregate to larger geographiesSubway_SBS_Qr_Mile_Access
sheet: missing total population counts by PUMA (total_pop_from_census_2020
) and missing population within 1/4 mile (pop_within_1/4_mile_of_subway_stations_and_sbs_stops
)access_to_openspace.py
Park_Access
sheet: missing total population counts by PUMA (Total_Pop20
) and missing population served (Pop_Served
)reviewing source data in Sharepoint for completeness
Equitable Development Data Tool - NYCHA - 1-27-2023 Final
nycha_tenants.py
reviewing source data in Sharepoint for completeness
EDDE - Math and ELA & Grad - 2022.csv
education_outcome.py
reviewing source data in Sharepoint for completeness
DOHMH_diabetes and self reported health
diabetes_self_report.py
DCHP_Diabetes_SelfRepHealth
sheet: boroughs and citywide aggregations have moved to first rows, combines Diabetes and Self Report Health into one sheetreviewing source data in Sharepoint for completeness
DOHMH_death rate and overdose.xlsx
health_mortality.py
reviewing source data in Sharepoint for completeness
COVID Deaths by race and PUMA_20230109.xlsx
covid_death.py
Sheet 1
: different number of pre-header rows with textI think we can rename the new source data to match the name of the indicator to help avoid unnecessary confusion between the source data name and the name of the ingestion script for that specific indicator i.e. NTA_data_prepared_for_ArcMap_wCodebook.xlsx
-> education_outcome_source_data.xlsx
Notes on improvements:
nycha_tenants
in the new subfolder. We can properly archive data via data-library and not have to do manual archiving, can utilize the more standard 01_dataloading.sh
process we often use in our other data products, and have a proper ETL pipelineresources/
directory and have either:
exported categories to edm-publishing on dev branch. action runs:
Housing Security Outputs:
units_affordable
(eli, vli, li, mi, midi, hi) for 2017 - 2021 not being populated in the housing security outputs
units_occupied_renter_1721
not populated but are in the PUMS data sent by winnie
The raw data can be accessed here: https://nyco365.sharepoint.com/:x:/r/sites/NYCPLANNING/itd/edm/_layouts/15/Doc.aspx?sourcedoc=%7B79FC4BE4-71E8-4082-B066-4DEA5DECEAA1%7D&file=EDDE_UnitsAffordablebyAMI_2017-2021.xlsx&action=default&mobileredirect=true
Housing Security Outputs:
units_affordable
(eli, vli, li, mi, midi, hi) for 2017 - 2021 not being populated in the housing security outputs
units_occupied_renter_1721
not populated but are in the PUMS data sent by winnieThe raw data can be accessed here: https://nyco365.sharepoint.com/❌/r/sites/NYCPLANNING/itd/edm/_layouts/15/Doc.aspx?sourcedoc=%7B79FC4BE4-71E8-4082-B066-4DEA5DECEAA1%7D&file=EDDE_UnitsAffordablebyAMI_2017-2021.xlsx&action=default&mobileredirect=true
Thanks, @mbh329 . There was one big section of logic specific to years in column names that I had missed, in the utils. See this commit
Latest build is now here - other than local testing for units_affordable and units_housing_tenure, I haven't checked any files since these latest changes, but will pick this up in the morning
In QOL, "prematuremortality" columns are missing, and have 2019 in the header which seems like a pretty sure giveaway that something is off
Do the versions
of the datasets in db-equitable-development-tool/ingest/data_library/datasets.yml need to be updated? I can't quickly tell if these are used or not.
If they're not use, I don't see any reason to note move the tables over to OSE (except for quality_of_life_puma.csv
). I've spot check outputs to inputs here and see that the issues above are fixed or in the process of being fixed (namely the NTA issue for the 15 fields in quality_of_life_puma.csv).
@AmandaDoyle looks like anytime read_from_s3()
is called, it's using the versions declared in the datasets.yml
you mentioned
it's currently used in 9 places, all of them in housing_security/
and housing_production/
. I think when I made changes to ingest the new transportation data, I replaced any use of s3 files with local files
Putting this here for now, but while things are fresh just wanted to log pain points, things that have gone wrong
editing to add more thoughts
Goal: Provide updated EDDE data to OSE by the end of March. App: https://equitableexplorer.planning.nyc.gov/map/data/district 2023 input data: here
Process:
housing_security
andquality_of_life
categories are to be updatedTODOs since we've reviewed new source data
edm-publishing/db-eddt/DEV_BRANCH_NAME/