#222: Excel to Output Schema Transformation Script

NYCPlanning / db-factfinder

data ETL for population fact finder (decennial + acs)

https://nycplanning.github.io/db-factfinder/factfinder/

MIT License

2 stars 3 forks source link

#222: Excel to Output Schema Transformation Script #226

Closed abrieff closed 2 years ago

abrieff commented 2 years ago

Addresses #222

Adds a script to the pipelines folder that takes in the given excel files and transforms them to the output schema for the ACS CSV. Overwrites existing ACS file, with expectation this will not be merged into the main branch.

The script uses regex logic to strip the field name from the excel sheet, and then pivots the table on each of the individual fields.

To run the script, after making sure run poetry install to install the additional package, run poetry run python pipelines/acs2020_manual_update.py -i < file_name> -g < year of ACS geography ( e.g. 2010_to_2020) -y < year of ACS5 e.g. 2019 (2014-2018) >

SashaWeinstein commented 2 years ago

I think you're off to a good start! I'll go through your questions one by one 1) I'm not sure where the files should live, I can dig into this after this initial review. I think we want to work within the existing file structure as we will never merge into main, so we don't have to worry about overwriting the more robust process that was in place 2) The geo/variable mappings shouldn't be part of this PR, they are their own ticket 3) I think it's smart to add poetry and always fine to add new packages. 4) Don't worry too much about speed

mbh329 commented 2 years ago

The general idea here is that there will essentially be a script to handle each ticket (?) - does it make sense for the data to live in one folder with different subfolders that indicate it's progress through the "one-off" ETL pipeline?

abrieff commented 2 years ago

@mbh329 - I couldn't decide whether there would be a script for each ticket or whether they should all be combined into one script. I think your suggestion about the folder/subfolder thing makes sense though.

SashaWeinstein commented 2 years ago

I'm not sure that "one script per ticket" is the right heuristic, but we definitely don't want the all the code for this build to be in one script. I also think it would be a mistake to add a new folder for our new files to live in. The existing file structure contains a lot of implicit cues to push our design in the right direction

SashaWeinstein commented 2 years ago

Ok so I really misunderstood how the existing acs pipeline worked and gave bad advice this morning, my apologies. I think your approach of outputting the acs.csv from this one .py file was correct. I think it should actually overwrite acs.py, and we can use gitlens to view the previous version as reference. Then as next steps we can borrow from the column mapping in acs_community_profiles.py and add that code to the new acs.py. Does this seem right to you Andrew?

abrieff commented 2 years ago

@SashaWeinstein that seems right to me.

SashaWeinstein commented 2 years ago

For reasons I don't understand I always needs to run python from the command line with the -m flag. So to get the code to run I ran

poetry run python3 -m pipelines.acs_2020_manual_update -i ACS1620Data_06-23-22.xlsx -g 2010_to_2020 -y 2019

I'm not really sure why I need the -m flag