deployment-gap-model-education-fund / deployment-gap-model

ETL code for the Deployment Gap Model Education Fund
https://www.deploymentgap.fund/
MIT License
6 stars 2 forks source link

Pull v2022.11.30 PUDL data from AWS #279

Closed bendnorman closed 1 year ago

bendnorman commented 1 year ago

This PR updates the MCOE table to use data from 2021.

pudl.sqlite is now pulled from AWS instead of Zenodo because it's 1) faster 2) easier to specify the version and 3) doesn't download all of the PUDL code and data.

@TrentonBush I'm getting a data validation error on line 251 in dbcp.data_mart.projects. Looks like there are more than 2 locations found for 2 projects. I'm not sure why this changed because this PR doesn't touch the project data. Maybe the pandas update changed the behavior of a function?

bendnorman commented 1 year ago

I'm a little confused by the asset statement on 251. It's checking to make sure a project doesn't have multiple locations but projects is used to create two location columns for a project.

Also, when I remove the asset statement, the ETL finishes without any db constraint failures.

TrentonBush commented 1 year ago

That assert is checking that there aren't more than 2 locations for any project, because the wide format data creates county_1 and county_2 columns. I can check what project is turning up with more than 2 locations and try to work backwards to figure out why that happened.

TrentonBush commented 1 year ago

I don't know why this only happened with the PUDL update, but there was one project that got split into 4 entries because bad location information -> geocoding failure -> null locations that didn't get dropped. I never got to the root cause, but it's only one project so I just fixed it manually.

bendnorman commented 1 year ago

Thanks for fixing it! It could have been a pandas 1.4 change that broke something. Should we merge it in?