Closed mbh329 closed 2 years ago
Did a first pass on the code, looks clean and good. some comments and questions but all minor The one big change is that it needs a main accessor function that takes a geography and returns a dataframe with an index of that geography. Can change the existing functions or write a new one, either is ok.
Do we need to merge in the exploratory python notebook and filtered columns .csv? might make more sense to keep them on your local machine Max
Next going to pull down the branch and check the internal review files for accuracy and to give feedback on column names
Thanks! I'll make those changes
I took a look at the internal review files and have some more feedback. This isn't a conclusive list of what needs to be done but hopefully it will helpful for next steps
Here is my feedback on the outputs:
Decennial
total_
from field namespop_wnh_pct
)
Otherwise numbers look good. This would cover everything required for the Race and Hispanic origin indicator00 PUMS
total_
from field names_est
from field names_zscore
to _pct_moe
Below are notes on what the root of the field name for each indicator is (root names do not need to be changed for 00 PUMS, or 12 or 19 outputs):
popu16
p16t64
p65pl
mdage
lep
p5pl
(add to 12 and 19 outputs if it's an easy lift)fb
The names and order for common fields should be the same across files. cc @SashaWeinstein since these field name conventions impact the 2012 and 2019 outputs.
Great - I am working on the decennial census script now.
@SashaWeinstein bchd is bachelors degree (educational related indicators) and the p25pl is the population over 25 that those indicators use as the denominator
@AmandaDoyle sorry about that, it is fixed now
Sorry I think my directions should have been more clear. You did a good job changing the functions so that they take a geography parameter and pass back the correct dataframe. Like the heat vulnerability your geography-specific functions can be DRY'd out in the same way but they are ok for now.
What I should have been clearer on is that the main accessor will take a year and a geography. So decennial_census_data
should take a third parameter of year and only pass back the data for that year. It shouldn't be to hard to implement this change but I should have told you when I told you about the geography parameter. You can generate all 3 dataframes for each call of decennial_census_data
if you want and only pass back the ones for the corresponding year or change create_<geography>_level_df_by_year
to take a year parameter and only pass back one dataframe. either is ok, code will run fast so redundancy doesn't matter
This looks good to me! Column names look good, indices look good
Some minor things:
df.loc["3701":"4114"]
I would filter by GeogType
column. Don't worry about these now if you don't have to. I'm creating a bunch of issues for simple refactors in issues with the enhancement label, we can document these potential upgrades there and forget about them for now
approved! this seemed like such an annoying slog, thanks for parsing through all this Max
Cannot be merged in because I have not approved final field names for PUMS
Oh ok sorry think I might be a step behind. Can we say that the code to get the total pop by race from the decennial census is ok? That was the only thing on my mind when I hit "approve", didn't think about the other survey we are looking at here
census_2000_PUMS.py
This script processes and exports the xlsx file Erica provided to DE with demographic related data points. In the original, xlsx file she provided there were some duplicate columns and columns that will need to be processed for the economic security category so I filtered these out. Columns required a bit of cleaning and chose to use regex to do this as there were a lot of columns that needed to be cleaned that followed pretty standard patterns - there might be a better way to do this but I couldn't think of anything.
decennial_census_001020.py
This script processes and exports the xlsx file Erica provided that contained demographic population data at the aggregated geography level from the decennial census in 2000, 2010, 2020. As this data is from the decennial census, there aren't any corresponding statistics (i.e cv, moe, etc). This was more straight forward and I decided to just rename the columns instead of using regex (as I had done with the other script).
internal_review
6 csv's total (3 for each script)
I tried following the naming conventions as closely as I could but honestly I could use some comments on my naming here, I was having trouble naming stuff.
decennial_census_exploratory
Not much to see here, kinda messy and was just using it to test the regex and test code.@AmandaDoyle @SashaWeinstein Looking forward to comments and what I can improve.