Process 2000 PUMS demographic data and 2000, 2010, 2020 Decennial Census demographic data from Erica

mbh329 commented 2 years ago

census_2000_PUMS.py

This script processes and exports the xlsx file Erica provided to DE with demographic related data points. In the original, xlsx file she provided there were some duplicate columns and columns that will need to be processed for the economic security category so I filtered these out. Columns required a bit of cleaning and chose to use regex to do this as there were a lot of columns that needed to be cleaned that followed pretty standard patterns - there might be a better way to do this but I couldn't think of anything.

decennial_census_001020.py

This script processes and exports the xlsx file Erica provided that contained demographic population data at the aggregated geography level from the decennial census in 2000, 2010, 2020. As this data is from the decennial census, there aren't any corresponding statistics (i.e cv, moe, etc). This was more straight forward and I decided to just rename the columns instead of using regex (as I had done with the other script).

internal_review

6 csv's total (3 for each script)

I tried following the naming conventions as closely as I could but honestly I could use some comments on my naming here, I was having trouble naming stuff.

decennial_census_exploratory Not much to see here, kinda messy and was just using it to test the regex and test code.

@AmandaDoyle @SashaWeinstein Looking forward to comments and what I can improve.

SashaWeinstein commented 2 years ago

Did a first pass on the code, looks clean and good. some comments and questions but all minor The one big change is that it needs a main accessor function that takes a geography and returns a dataframe with an index of that geography. Can change the existing functions or write a new one, either is ok.

Do we need to merge in the exploratory python notebook and filtered columns .csv? might make more sense to keep them on your local machine Max

Next going to pull down the branch and check the internal review files for accuracy and to give feedback on column names

mbh329 commented 2 years ago

Thanks! I'll make those changes

SashaWeinstein commented 2 years ago

I took a look at the internal review files and have some more feedback. This isn't a conclusive list of what needs to be done but hopefully it will helpful for next steps

Race breakdown for 2000, 2010, 2020

Geography should be the index so there isn't a blank leftmost column
There should be one file per year and no information about the year in the column headers

All indicators 2000

Don't need 00 in column headers
Don't need z-scores
Some columns I don't recognize such bchd or p25pl

AmandaDoyle commented 2 years ago

Here is my feedback on the outputs:

Decennial

[x] Update output tables. These are the tables we scoped out for Demographics, so unlike QOL years are broken out into separate tables for Demographic and Economic categories.
[x] Remove total_ from field names
[x] Remove the 'year' from the field name since they'll be in separate tables (i.e. would be pop_wnh_pct) Otherwise numbers look good. This would cover everything required for the Race and Hispanic origin indicator

00 PUMS

[ ] Remove total_ from field names
[ ] Remove the 'year' from the field name since they'll be in separate tables
[ ] Remove _est from field names
[ ] Update _zscore to _pct_moe
[ ] Limit the Demographic outputs to only include demographic indicators (see note below on root names). Less than HS and Pop 25 plus, for example, are part of the economic output. Otherwise numbers look good. This would cover the 00 data required for the Demographic category

Below are notes on what the root of the field name for each indicator is (root names do not need to be changed for 00 PUMS, or 12 or 19 outputs):

<16 = popu16
16-64 = p16t64
">=65" = p65pl
Median age = mdage
Limited English speaking population = lep
Total pop >5 = p5pl (add to 12 and 19 outputs if it's an easy lift)
Foreign born population = fb

The names and order for common fields should be the same across files. cc @SashaWeinstein since these field name conventions impact the 2012 and 2019 outputs.

mbh329 commented 2 years ago

Great - I am working on the decennial census script now.

@SashaWeinstein bchd is bachelors degree (educational related indicators) and the p25pl is the population over 25 that those indicators use as the denominator

mbh329 commented 2 years ago

@AmandaDoyle sorry about that, it is fixed now

SashaWeinstein commented 2 years ago

Sorry I think my directions should have been more clear. You did a good job changing the functions so that they take a geography parameter and pass back the correct dataframe. Like the heat vulnerability your geography-specific functions can be DRY'd out in the same way but they are ok for now.

What I should have been clearer on is that the main accessor will take a year and a geography. So decennial_census_data should take a third parameter of year and only pass back the data for that year. It shouldn't be to hard to implement this change but I should have told you when I told you about the geography parameter. You can generate all 3 dataframes for each call of decennial_census_data if you want and only pass back the ones for the corresponding year or change create_<geography>_level_df_by_year to take a year parameter and only pass back one dataframe. either is ok, code will run fast so redundancy doesn't matter

SashaWeinstein commented 2 years ago

This looks good to me! Column names look good, indices look good

Some minor things:

instead of hardcoding borough abbreviations you can call the function from PUMA helpers
instead of getting PUMAs like df.loc["3701":"4114"] I would filter by GeogType column.

Don't worry about these now if you don't have to. I'm creating a bunch of issues for simple refactors in issues with the enhancement label, we can document these potential upgrades there and forget about them for now

AmandaDoyle commented 2 years ago

approved! this seemed like such an annoying slog, thanks for parsing through all this Max

Cannot be merged in because I have not approved final field names for PUMS

SashaWeinstein commented 2 years ago

Oh ok sorry think I might be a step behind. Can we say that the code to get the total pop by race from the decennial census is ok? That was the only thing on my mind when I hit "approve", didn't think about the other survey we are looking at here

NYCPlanning / db-equitable-development-tool

Process 2000 PUMS demographic data and 2000, 2010, 2020 Decennial Census demographic data from Erica #181

Race breakdown for 2000, 2010, 2020

All indicators 2000