2023 EDDE Update - Githubissues

AmandaDoyle commented 1 year ago

Goal: Provide updated EDDE data to OSE by the end of March. App: https://equitableexplorer.planning.nyc.gov/map/data/district 2023 input data: here

Process:

Update source data tables here
May want to archive old tables
If there is no updated source data we are not updating that dataset. Leave existing data table in place.
Run the action to update the data for an EDDE category
- based on new source data, only housing_security and quality_of_life categories are to be updated
Troubleshoot as needed. Issues may be schema or underlying data changed. If this happens and a fix is not simple and obvious let's quickly huddle to figure out how to address problem.
Updated datasets are pushed up to DO here and these are what's delivered to OSE.

TODOs since we've reviewed new source data

[x] request new transportation data to get missing columns per details below
[x] archive old resources files
[x] add new files to repo resources
[x] do any necessary manual excel file and script edits
[x] when we get it, process and use new transportation data
[x] when we get it, use new PUMS data
[x] run Export Category action for relevant categories on dev branch
[x] confirm new data is in edm-publishing/db-eddt/DEV_BRANCH_NAME/
[x] run Export Category action for relevant categories on main branch

damonmcc commented 1 year ago

reviewing source data in Sharepoint for completeness

EDDE_2023_Updates_transportation.xlsx
- incomplete data
- access_subway_and_access_ADA.py
- ADA_Subway_Qtr_Mile_Access sheet: missing total population counts by PUMA (total_pop_from_census_2020) and missing the population within 1/4 mile of ADA subway stations (pop_within_1/4_mile_of_ada_subway_stations) which we need in order to aggregate to larger geographies
- Subway_SBS_Qr_Mile_Access sheet: missing total population counts by PUMA (total_pop_from_census_2020) and missing population within 1/4 mile (pop_within_1/4_mile_of_subway_stations_and_sbs_stops)
- access_to_openspace.py
- Park_Access sheet: missing total population counts by PUMA (Total_Pop20) and missing population served (Pop_Served)

damonmcc commented 1 year ago

reviewing source data in Sharepoint for completeness

Equitable Development Data Tool - NYCHA - 1-27-2023 Final
- complete data and aligns with old schema
- nycha_tenants.py

damonmcc commented 1 year ago

reviewing source data in Sharepoint for completeness

EDDE - Math and ELA & Grad - 2022.csv
- complete data
- education_outcome.py
- need to add Student Performance codes and change NTACode values to include borough code

damonmcc commented 1 year ago

reviewing source data in Sharepoint for completeness

DOHMH_diabetes and self reported health
- complete data
- diabetes_self_report.py
- DCHP_Diabetes_SelfRepHealth sheet: boroughs and citywide aggregations have moved to first rows, combines Diabetes and Self Report Health into one sheet

damonmcc commented 1 year ago

reviewing source data in Sharepoint for completeness

DOHMH_death rate and overdose.xlsx
- health_mortality.py
- complete data with expected new columns
- formerly separate files for PUMA and Borough/City geographies are now combined in one file

damonmcc commented 1 year ago

reviewing source data in Sharepoint for completeness

COVID Deaths by race and PUMA_20230109.xlsx
- complete data
- covid_death.py
- Sheet 1: different number of pre-header rows with text

mbh329 commented 1 year ago

I think we can rename the new source data to match the name of the indicator to help avoid unnecessary confusion between the source data name and the name of the ingestion script for that specific indicator i.e. NTA_data_prepared_for_ArcMap_wCodebook.xlsx -> education_outcome_source_data.xlsx

mbh329 commented 1 year ago

Notes on improvements:

It seems like an improvement we can make is create some sort of data-library ingestion process for the disparate source data we receive from the various data providers. The data for this product is super specific and might benefit from being in a subfolder within data library but follow all data-library naming conventions e.g. the source data for the indicator nycha_tenants would be named nycha_tenants in the new subfolder. We can properly archive data via data-library and not have to do manual archiving, can utilize the more standard 01_dataloading.sh process we often use in our other data products, and have a proper ETL pipeline
We should be very specific with Housing and our data providers as to what we need from them. This is specifically important to column schema in our data, text left in excel spreadsheets not needed for our ETL, naming conventions for each year of new data, etc...Although this is difficult to establish, it would be beneficial if we send them a column schema we need and if we need aggregated numbers.

damonmcc commented 1 year ago

update

using the PR https://github.com/NYCPlanning/db-equitable-development-tool/pull/318
awaiting valid Transportation data from Winnie
all 5 other new source data files have been added to the resources/ directory and have either:
- not required any processing to align with previous file structure
- been manually processed to align with previous file structure
- not been manually processed to force new more ideal approaches in the ingest/aggregate script
Winnie has been informed of the concert with Education Outcomes data (Other racial group is now split into 3 new groups). Other isn't shown in the EDDE application and hopefully OSE doesn't ingest it
while awaiting data, will ensure processed files are correctly ingested/aggregated by relevant scripts

damonmcc commented 1 year ago

exported categories to edm-publishing on dev branch. action runs:

damonmcc commented 1 year ago

first pass at using new source data was done in https://github.com/NYCPlanning/db-equitable-development-tool/pull/318
working on transportation and PUMS data

mbh329 commented 1 year ago

Housing Security Outputs:

units_affordable (eli, vli, li, mi, midi, hi) for 2017 - 2021 not being populated in the housing security outputs

units_occupied_renter_1721 not populated but are in the PUMS data sent by winnie

The raw data can be accessed here: https://nyco365.sharepoint.com/:x:/r/sites/NYCPLANNING/itd/edm/_layouts/15/Doc.aspx?sourcedoc=%7B79FC4BE4-71E8-4082-B066-4DEA5DECEAA1%7D&file=EDDE_UnitsAffordablebyAMI_2017-2021.xlsx&action=default&mobileredirect=true

fvankrieken commented 1 year ago

Housing Security Outputs:

units_affordable (eli, vli, li, mi, midi, hi) for 2017 - 2021 not being populated in the housing security outputs

units_occupied_renter_1721 not populated but are in the PUMS data sent by winnie

The raw data can be accessed here: https://nyco365.sharepoint.com/❌/r/sites/NYCPLANNING/itd/edm/_layouts/15/Doc.aspx?sourcedoc=%7B79FC4BE4-71E8-4082-B066-4DEA5DECEAA1%7D&file=EDDE_UnitsAffordablebyAMI_2017-2021.xlsx&action=default&mobileredirect=true

Thanks, @mbh329 . There was one big section of logic specific to years in column names that I had missed, in the utils. See this commit

Latest build is now here - other than local testing for units_affordable and units_housing_tenure, I haven't checked any files since these latest changes, but will pick this up in the morning

fvankrieken commented 1 year ago

In QOL, "prematuremortality" columns are missing, and have 2019 in the header which seems like a pretty sure giveaway that something is off

AmandaDoyle commented 1 year ago

Do the versions of the datasets in db-equitable-development-tool/ingest/data_library/datasets.yml need to be updated? I can't quickly tell if these are used or not.

If they're not use, I don't see any reason to note move the tables over to OSE (except for quality_of_life_puma.csv). I've spot check outputs to inputs here and see that the issues above are fixed or in the process of being fixed (namely the NTA issue for the 15 fields in quality_of_life_puma.csv).

damonmcc commented 1 year ago

@AmandaDoyle looks like anytime read_from_s3() is called, it's using the versions declared in the datasets.yml you mentioned

it's currently used in 9 places, all of them in housing_security/ and housing_production/. I think when I made changes to ingest the new transportation data, I replaced any use of s3 files with local files

fvankrieken commented 1 year ago

Putting this here for now, but while things are fresh just wanted to log pain points, things that have gone wrong

excel inputs obviously. Would love to get things into digital ocean, but then we also have to think a bit about
1. data cleaning of these inputs/validation, because they're pretty manually generated, and
2. making things easy on user's end (the user being me). It's not horribly difficult to just load a bunch of files to DO via data library but if we could batch by a whole folder of files that would be nice
related, but make sure all versions are getting into the output version files. These could also use a retool to align with other repos
There's one random web call that should be moved into data library, believe it's dhs shelters. Missed updating this on first go around
The reordering columns function simply drops columns not in the new generated ordering. This could 1. maybe use some logic to order based on the columns present, or 2. just fail if some column isn't included in the re-ordering. I think actually we have custom generated column ordering list fed into a standard pd function, so as I type this I realize that maybe that's just an option that needs to be fed to the function. This caused a couple issues, first during our review where there was missing data, and then a second issue flagged by WS where columns were simply missing - in this given indicator, we reordered columns and then dropped all empty columns, leading to a more silent failure. Could have been captured by actually just dumping out number of columns for all files and comparing, but also ideally should not have happened. See here, but just need to be a tad more thorough for changes.
Similar to above, maybe take some time to add explicit failures if data frames are ever empty
Final pain point was a little self-caused, I forgot to check in with Tyler/Winnie about inflation-adjusted numbers and it turns out that we were supposed to replace 08-12 acs data with the inflation adjusted. That's taken care of here, but just personally need to be a tad more thorough

editing to add more thoughts

get rid of all extraneous files/functions in repo. Waiting on tagging and this until we're sure that the final version is out

NYCPlanning / db-equitable-development-tool

2023 EDDE Update #312

update