NYCPlanning / db-developments

🏠 🏘️ 🏗️ Developments Database
https://nycplanning.github.io/db-developments
8 stars 2 forks source link

Replace extra whitespace in address fields #559

Closed SashaWeinstein closed 2 years ago

SashaWeinstein commented 2 years ago

First round of updates in for the 22Q2 enhancements 🏠 ! I started with the easiest and it was still a little hard

Addresses issue #557 which was slightly misdiagnosed. The extra whitespace is within the street name, not between the house number and the street name.

I chose to fix this in the geocode.py file as this is the farthest upstream, being called in the data sync step before the build starts. Thanks to @td928 who helped me figure out some address values in the source data had four spaces. I think it makes sense for Amanda or Max to review this as Te and I worked together

mbh329 commented 2 years ago

Sam mentioned this problem exists when he views the data in excel correct? Does it make sense to run a build on this branch and check the data in excel or is that overkill?

mbh329 commented 2 years ago

Nvm I just downloaded final_devdb from postico and looked at it in excel. I am still seeing extra spaces in address_st and address columns. If its an issue with the final tables, why are we focusing on an upstream process? Can we just trim the whitespace of all address columns right before we export the tables - I also imagine we could probably trim extra spaces across all the columns

SashaWeinstein commented 2 years ago

This upgrade changes the code run in the data sync step which is different from the build. So running the build from this branch won't test the new code. I know Te was working on running the data sync itself

SashaWeinstein commented 2 years ago

We had a conversation in teams that probably should have happened here

SashaWeinstein commented 2 years ago

When I try to re-download dob_geocode_results I get this error

psql:/Users/alexanderweinstein/Documents/Data_Products/db-developments/.library/datasets/dob_geocode_results/20220831/dob_geocode_results.sql:4: ERROR:  relation "dob_geocode_results_pk" already exists

So I ran

DROP INDEX dob_geocode_results_pk

And got

ERROR:  cannot drop index dob_geocode_results_pk because constraint dob_geocode_results_pk on table _geo_devdb requires it
HINT:  You can drop constraint dob_geocode_results_pk on table _geo_devdb instead.

So finally I ran

DROP TABLE _geo_devdb

which allowed the dataloading to work. Do we want to add this code to the dataloading step? If so I can open an issue and implement a fix

SashaWeinstein commented 2 years ago

I'm trying to figure out how the addresses in dob now have this issue w/ extra spaces that we filter out and then a very similar error gets re-introduced when we geocode. I'm realizing now that the addresses in the source data could have been from the same geosupport bug. Does this seem plausible?

AmandaDoyle commented 2 years ago

AD feedback - Moving this to the data loading / geocoding step should have no negative downstream impact and removal of white spaces should flow downstream and appear in final product.
My one question is can we simplify the logic to use strip() instead of REGEXP_REPLACE? I think it's worth talking through why you were getting the dob_geocode_results error before landing on a solution. Re th last comment you left about geocoding errors, I'm not following. Should we hop on a call? I'm free until 2

SashaWeinstein commented 2 years ago

Yea I think us two and (@td928 if he wants, he figured out this error) should talk this through. I just started lunch, can we chat at 1:30?

td928 commented 2 years ago

When I try to re-download dob_geocode_results I get this error

psql:/Users/alexanderweinstein/Documents/Data_Products/db-developments/.library/datasets/dob_geocode_results/20220831/dob_geocode_results.sql:4: ERROR:  relation "dob_geocode_results_pk" already exists

So I ran

DROP INDEX dob_geocode_results_pk

And got

ERROR:  cannot drop index dob_geocode_results_pk because constraint dob_geocode_results_pk on table _geo_devdb requires it
HINT:  You can drop constraint dob_geocode_results_pk on table _geo_devdb instead.

So finally I ran

DROP TABLE _geo_devdb

which allowed the dataloading to work. Do we want to add this code to the dataloading step? If so I can open an issue and implement a fix

yep I think this fix should work and maybe don't even need another PR just for this

SashaWeinstein commented 2 years ago

I think we do, it's separate from fixing the whitespace. I'll open an issue and a PR today

mbh329 commented 2 years ago

Sorry hold off on merging, I thought my changes requested was holding up the merge but I didn't fully review

SashaWeinstein commented 2 years ago

No need to apologize, think it's good to flag a PR when you think there is an issue. Just wanted to make sure your request for changes wouldn't block the merge

mbh329 commented 2 years ago

@td928 @SashaWeinstein What's the final table that needs to be reviewed? Does the data sync still have to be completed to properly review the output?

SashaWeinstein commented 2 years ago

Good question! The way to figure that out is to use the "view commits since your last review" option under the files changed tab. There you can see what has changed most recently which will help answer your question I think

mbh329 commented 2 years ago

Got it. Was confused because yesterday I was told final_devdb was not the table to look at it and that a data sync step needed to be completed but the latest code you wrote makes it easier to follow. Lgtm, no more extra spaces