NYCPlanning / db-equitable-development-tool

Data Repo for the equitable development tool (EDDT)
MIT License
0 stars 0 forks source link

Process 2000 PUMS demographic data and 2000, 2010, 2020 Decennial Census demographic data from Erica #181

Closed mbh329 closed 2 years ago

mbh329 commented 2 years ago

census_2000_PUMS.py

This script processes and exports the xlsx file Erica provided to DE with demographic related data points. In the original, xlsx file she provided there were some duplicate columns and columns that will need to be processed for the economic security category so I filtered these out. Columns required a bit of cleaning and chose to use regex to do this as there were a lot of columns that needed to be cleaned that followed pretty standard patterns - there might be a better way to do this but I couldn't think of anything.

decennial_census_001020.py

This script processes and exports the xlsx file Erica provided that contained demographic population data at the aggregated geography level from the decennial census in 2000, 2010, 2020. As this data is from the decennial census, there aren't any corresponding statistics (i.e cv, moe, etc). This was more straight forward and I decided to just rename the columns instead of using regex (as I had done with the other script).

internal_review

6 csv's total (3 for each script)

I tried following the naming conventions as closely as I could but honestly I could use some comments on my naming here, I was having trouble naming stuff.

decennial_census_exploratory Not much to see here, kinda messy and was just using it to test the regex and test code.

@AmandaDoyle @SashaWeinstein Looking forward to comments and what I can improve.

SashaWeinstein commented 2 years ago

Did a first pass on the code, looks clean and good. some comments and questions but all minor The one big change is that it needs a main accessor function that takes a geography and returns a dataframe with an index of that geography. Can change the existing functions or write a new one, either is ok.

Do we need to merge in the exploratory python notebook and filtered columns .csv? might make more sense to keep them on your local machine Max

Next going to pull down the branch and check the internal review files for accuracy and to give feedback on column names

mbh329 commented 2 years ago

Thanks! I'll make those changes

SashaWeinstein commented 2 years ago

I took a look at the internal review files and have some more feedback. This isn't a conclusive list of what needs to be done but hopefully it will helpful for next steps

Race breakdown for 2000, 2010, 2020

All indicators 2000

AmandaDoyle commented 2 years ago

Here is my feedback on the outputs:

Decennial

00 PUMS

Below are notes on what the root of the field name for each indicator is (root names do not need to be changed for 00 PUMS, or 12 or 19 outputs):

The names and order for common fields should be the same across files. cc @SashaWeinstein since these field name conventions impact the 2012 and 2019 outputs.

mbh329 commented 2 years ago

Great - I am working on the decennial census script now.

@SashaWeinstein bchd is bachelors degree (educational related indicators) and the p25pl is the population over 25 that those indicators use as the denominator

mbh329 commented 2 years ago

@AmandaDoyle sorry about that, it is fixed now

SashaWeinstein commented 2 years ago

Sorry I think my directions should have been more clear. You did a good job changing the functions so that they take a geography parameter and pass back the correct dataframe. Like the heat vulnerability your geography-specific functions can be DRY'd out in the same way but they are ok for now.

What I should have been clearer on is that the main accessor will take a year and a geography. So decennial_census_data should take a third parameter of year and only pass back the data for that year. It shouldn't be to hard to implement this change but I should have told you when I told you about the geography parameter. You can generate all 3 dataframes for each call of decennial_census_data if you want and only pass back the ones for the corresponding year or change create_<geography>_level_df_by_year to take a year parameter and only pass back one dataframe. either is ok, code will run fast so redundancy doesn't matter

SashaWeinstein commented 2 years ago

This looks good to me! Column names look good, indices look good

Some minor things:

  1. instead of hardcoding borough abbreviations you can call the function from PUMA helpers
  2. instead of getting PUMAs like df.loc["3701":"4114"] I would filter by GeogType column.

Don't worry about these now if you don't have to. I'm creating a bunch of issues for simple refactors in issues with the enhancement label, we can document these potential upgrades there and forget about them for now

AmandaDoyle commented 2 years ago

approved! this seemed like such an annoying slog, thanks for parsing through all this Max

Cannot be merged in because I have not approved final field names for PUMS

SashaWeinstein commented 2 years ago

Oh ok sorry think I might be a step behind. Can we say that the code to get the total pop by race from the decennial census is ok? That was the only thing on my mind when I hit "approve", didn't think about the other survey we are looking at here